本篇博文主要内容为 2026-04-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-04-16)
今日共更新592篇论文,其中:
- 自然语言处理共97篇(Computation and Language (cs.CL))
- 人工智能共163篇(Artificial Intelligence (cs.AI))
- 计算机视觉共124篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共170篇(Machine Learning (cs.LG))
- 多智能体系统共12篇(Multiagent Systems (cs.MA))
- 信息检索共18篇(Information Retrieval (cs.IR))
- 人机交互共19篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] [COMP25] The Automated Negotiating Agents Competition (ANAC) 2025 Challenges and Results IJCAI2026
【速读】:该论文旨在解决多轮谈判(multi-deal negotiations)与复杂供应链管理环境中代理并发谈判能力(concurrent negotiation capability)的两大核心挑战。其解决方案的关键在于通过国际自动谈判代理竞赛(ANAC 2025)的实证研究,系统评估不同算法在上述场景下的表现,并基于竞赛结果提炼出可复用的策略框架与技术路径,从而为未来智能谈判代理的设计提供明确的方向指引。
链接: https://arxiv.org/abs/2604.13914
作者: Reyhan Aydoğan,Tim Baarslag,Tamara C.P. Florijn,Katsuhide Fujita,Catholijn M. Jonker,Yasser Mohammad
机构: Özyeğin Üniversity(Özyeğin大学); Delft University of Technology(代尔夫特理工大学); Centrum Wiskunde Informatica (CWI)(荷兰数学与计算机科学研究中心); Eindhoven University of Technology(埃因霍温理工大学); Utrecht University(乌得勒支大学); Tokyo University of Agriculture and Technology(东京农工大学); National Institute for Advanced Industrial Science and Technology(日本先进工业科学技术研究院); Leiden University(莱顿大学); NEC Corporation(NEC公司)
类目: Multiagent Systems (cs.MA)
备注: Submitted as demo to IJCAI 2026
Abstract:This paper presents the primary research challenges and key findings from the 15th International Automated Negotiating Agents Competition (ANAC 2025), one of the official competitions of IJCAI 2025. We focus on two critical domains: multi-deal negotiations and the development of agents capable of concurrent negotiation within complex supply chain management environments. Furthermore, this work analyzes the results of the competition and outlines strategic directions for future iterations.
[MA-1] Beyond Arrows Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在公平性研究中长期存在的单一中心优化范式局限问题,即如何在多代理交互环境中实现更动态、可协商的公平性机制。其解决方案的关键在于提出“公平性是通过代理间交互与交换而涌现”的新视角,并基于一个受控的医院分诊框架进行实证验证:两个代理在三轮结构化辩论中协商资源分配,其中一方通过检索增强生成(Retrieval-Augmented Generation, RAG)对齐特定伦理框架,另一方则可能未对齐或被恶意提示以偏袒特定人口群体而非临床需求。结果表明,尽管单独个体的分配均不满足公平标准,但两者联合决策却能达成单个代理无法实现的公平结果;且对齐代理通过“对抗性修正”而非直接取代的方式部分缓解偏差,体现出系统级公平的形成机制。这一发现将公平性重新定义为分布式代理互动中的程序性属性,强调评价单位应从个体代理转向整体交互系统。
链接: https://arxiv.org/abs/2604.13705
作者: Sayan Kumar Chaki,Antoine Gourru,Julien Velcin
机构: Laboratoire Hubert Curien, UMR CNRS 5516; École Centrale de Lyon, LIRIS CNRS UMR 5205
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval-augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent’s allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left-leaning tendencies in LLMs. We connect these limits to Arrow’s Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi-agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.
[MA-2] MIND: AI Co-Scientist for Material Research ECML KDD2026
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体系统在科学发现中普遍存在的局限性——即主要依赖文本推理而缺乏自动化实验验证的问题。其解决方案的关键在于提出MIND框架,该框架通过多智能体流水线实现假设精炼、实验验证与基于辩论的验证闭环;其中,核心创新是集成机器学习原子间势(Machine Learning Interatomic Potentials),特别是SevenNet-Omni模型,以支持可扩展的虚拟实验(in-silico experiments),从而实现从假设生成到自动验证的全流程闭环,显著提升了材料研究中AI驱动科学发现的可信度与实用性。
链接: https://arxiv.org/abs/2604.13699
作者: Geonhee Ahn,Donghyun Lee,Hayoung Doo,Jonggeol Na,Hyunsoo Cho,Sookyung Kim
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 4 pages, 3 figures. Under review for ECML PKDD 2026 Demonstration Track. Code available at this https URL . Demo video available at this https URL
Abstract:Large language models (LLMs) have enabled agentic AI systems for scientific discovery, but most approaches remain limited to textbased reasoning without automated experimental verification. We propose MIND, an LLM-driven framework for automated hypothesis validation in materials research. MIND organizes the scientific discovery process into hypothesis refinement, experimentation, and debate-based validation within a multi-agent pipeline. For experimental verification, the system integrates Machine Learning Interatomic Potentials, particularly SevenNet-Omni, enabling scalable in-silico experiments. We also provide a web-based user interface for automated hypothesis testing. The modular design allows additional experimental modules to be integrated, making the framework adaptable to broader scientific workflows. The code is available at: this https URL, and a demonstration video at: this https URL.
[MA-3] Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中奖励函数设计效率低、成本高且易受冗余和局部不确定性影响的问题。传统方法依赖大量人工设计与评估,难以捕捉中间决策点的局部不确定性,导致优化过程低效。其解决方案的关键在于提出链式不确定奖励(Chain of Uncertain Rewards, CoUR)框架,该框架利用大语言模型(Large Language Models, LLMs)实现奖励函数组件的自动识别与复用,通过引入代码不确定性量化和相似性选择机制,结合文本与语义分析来筛选最相关组件;同时采用贝叶斯优化对解耦后的奖励项进行高效搜索,从而显著减少重复评估并提升奖励函数设计的鲁棒性与效率。
链接: https://arxiv.org/abs/2604.13504
作者: Shentong Mo
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging and labor-intensive process due to the inefficiencies and inconsistencies inherent in traditional methods. Existing methods often rely on extensive manual design and evaluation steps, which are prone to redundancy and overlook local uncertainties at intermediate decision points. To address these challenges, we propose the Chain of Uncertain Rewards (CoUR), a novel framework that integrates large language models (LLMs) to streamline reward function design and evaluation in RL environments. Specifically, our CoUR introduces code uncertainty quantification with a similarity selection mechanism that combines textual and semantic analyses to identify and reuse the most relevant reward function components. By reducing redundant evaluations and leveraging Bayesian optimization on decoupled reward terms, CoUR enables a more efficient and robust search for optimal reward feedback. We comprehensively evaluate CoUR across nine original environments from IsaacGym and all 20 tasks from the Bidexterous Manipulation benchmark. The experimental results demonstrate that CoUR not only achieves better performance but also significantly lowers the cost of reward evaluations.
[MA-4] Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus
【速读】:该论文旨在解决协作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)中因分解联合观测与动作空间所引入的非平稳性、训练不稳定、协调能力弱及理论保障不足等问题。其核心解决方案是提出了一种名为共识多智能体Transformer(Consensus Multi-Agent Transformer, CMAT)的集中式框架,该框架将所有智能体视为统一实体,并利用Transformer编码器处理大规模联合观测空间;同时通过引入层次化决策机制——即由Transformer解码器自回归生成一个高维共识向量(consensus vector),模拟智能体在潜在空间中达成策略一致的过程——并基于此共识向量同步生成各智能体的动作,从而实现与动作生成顺序无关的联合决策,避免了传统多智能体Transformer(Multi-Agent Transformer, MAT)对动作生成顺序的敏感性。该设计使得联合策略可使用单智能体PPO进行优化,同时保留了通过潜在共识表达的强协调能力。
链接: https://arxiv.org/abs/2604.13472
作者: Zijian Zhao,Jing Gao,Sen Li
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Cooperative multi-agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non-stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi-Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single-agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision-making mechanism in which a Transformer decoder autoregressively generates a high-level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order-independent joint decision making and avoiding the sensitivity to action-generation order in conventional Multi-Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single-agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi-Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:this https URL .
[MA-5] Learning Probabilistic Responsibility Allocations for Multi-Agent Interactions
【速读】:该论文旨在解决多智能体交互中责任分配(responsibility allocation)的建模问题,即如何量化个体在互动中为适应他人而偏离自身最优策略的程度,从而为设计社会合规且可信的自主系统提供依据。其解决方案的关键在于提出一种基于条件变分自编码器(conditional variational autoencoder)潜在空间的概率责任分配模型,结合多智能体轨迹预测技术,学习以场景和智能体上下文为条件的责任分配分布;并通过引入可微优化层将责任分配映射为可获取的控制输入,从而在无真实责任标签的情况下实现模型训练与推断,最终在INTERACTION驾驶数据集上验证了其预测性能与可解释性。
链接: https://arxiv.org/abs/2604.13128
作者: Isaac Remy,Caleb Chang,Karen Leung
机构: University of Washington (华盛顿大学); NVIDIA (英伟达)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
Abstract:Human behavior in interactive settings is shaped not only by individual objectives but also by shared constraints with others, such as safety. Understanding how people allocate responsibility, i.e., how much one deviates from their desired policy to accommodate others, can inform the design of socially compliant and trustworthy autonomous systems. In this work, we introduce a method for learning a probabilistic responsibility allocation model that captures the multimodal uncertainty inherent in multi-agent interactions. Specifically, our approach leverages the latent space of a conditional variational autoencoder, combined with techniques from multi-agent trajectory forecasting, to learn a distribution over responsibility allocations conditioned on scene and agent context. Although ground-truth responsibility labels are unavailable, the model remains tractable by incorporating a differentiable optimization layer that maps responsibility allocations to induced controls, which are available. We evaluate our method on the INTERACTION driving dataset and demonstrate that it not only achieves strong predictive performance but also provides interpretable insights, through the lens of responsibility, into patterns of multi-agent interaction.
[MA-6] Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在软件开发生命周期(Software Development Lifecycle, SDLC)中嵌入生成式AI(Generative AI)模型后,其公平性问题尚未得到充分研究的挑战。核心问题是当前MAS相关工具在代码编写、审查与发布等环节可能隐含偏见或不公平行为,但缺乏系统性的评估框架和可落地的治理机制。解决方案的关键在于:通过快速文献综述筛选出18项相关研究,识别出公平性的三重维度——可信AI原则、群体间偏差减少及集体互动动态,并揭示三大关键差距:(1)评估方法碎片化且缺乏MAS特异性,(2)环境简化导致泛化能力弱,(3)缓解策略与治理机制稀缺且验证不足。论文据此提出亟需构建面向MAS的基准测试、统一评估协议以及贯穿SDLC的治理框架,以推动公平性保障的可部署软件系统发展。
链接: https://arxiv.org/abs/2604.13103
作者: Corey Yang-Smith,Ronnie de Souza Santos,Ahmad Abdellatif
机构: University of Calgary (卡尔加里大学)
类目: oftware Engineering (cs.SE); Multiagent Systems (cs.MA)
备注: 8 pages, 4 figures. Accepted to the LLMTrust workshop at FSE Companion 2026
Abstract:Transformer-based large language models (LLMs) and multi-agent systems (MAS) are increasingly embedded across the software development lifecycle (SDLC), yet their fairness implications for developer-facing tools remain underexplored despite their growing role in shaping what code is written, reviewed, and released. We present a rapid review of recent work on fairness in MAS, emphasizing LLM-enabled settings and relevance to software engineering. Starting from an initial set of 350 papers, we screened and filtered the corpus for relevance, retaining 18 studies for final analysis. Across these 18 studies, fairness is framed as a combination of trustworthy AI principles, bias reduction across groups, and interactional dynamics in collectives, while evaluation spans accuracy metrics on bias benchmarks, demographic disparity measures, and emergent MAS-specific notions such as conformity and bias amplification. Reported harms include representational, quality-of-service, security and privacy, and governance failures, which we relate to SDLC stages where evidence is most and least developed. We identify three persistent gaps: (1) fragmented, rarely MAS-specific evaluation practices that limit comparability, (2) limited generalization due to simplified environments and narrow attribute coverage, and (3) scarce, weakly evaluated mitigation and governance mechanisms aligned to real software workflows. These findings suggest MAS fairness research is not yet ready to support deployable, fairness-assured software systems, motivating MAS-aware benchmarks, consistent protocols, and lifecycle-spanning governance.
[MA-7] C2T: Captioning-Structure and LLM -Aligned Common-Sense Reward Learning for Traffic–Vehicle Coordination CVPR2026
【速读】:该论文旨在解决当前基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的城市场景交通信号控制(Traffic Light Control, TLC)系统中,因依赖人工设计且短视的奖励函数(如交叉口压力)而导致性能受限的问题。此类奖励无法有效体现人类中心的目标,如安全性、流稳定性和舒适性。解决方案的关键在于提出C2T框架,该框架通过从大型语言模型(Large Language Model, LLM)中蒸馏“常识”知识,构建一个可学习的内在奖励函数(intrinsic reward function),并将其用于指导协同式多交叉口TLC的MARL策略优化。这一机制显著提升了交通效率、安全性和能源相关指标,并具备灵活调整策略方向(如效率优先或安全优先)的能力,仅需修改LLM提示词即可实现。
链接: https://arxiv.org/abs/2604.13098
作者: Yuyang Chen,Kaiyan Zhao,Yiming Wang,Ming Yang,Bin Rao,Zhenning Li
机构: University of Macau(澳门大学); Wuhan University(武汉大学); Hong Kong Polytechnic University(香港理工大学); University of Macau(澳门大学)
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to CVPR 2026 Findings Track
Abstract:State-of-the-art (SOTA) urban traffic control increasingly employs Multi-Agent Reinforcement Learning (MARL) to coordinate Traffic Light Controllers (TLCs) and Connected Autonomous Vehicles (CAVs). However, the performance of these systems is fundamentally capped by their hand-crafted, myopic rewards (e.g., intersection pressure), which fail to capture high-level, human-centric goals like safety, flow stability, and comfort. To overcome this limitation, we introduce C2T, a novel framework that learns a common-sense coordination model from traffic-vehicle dynamics. C2T distills “common-sense” knowledge from a Large Language Model (LLM) into a learned intrinsic reward function. This new reward is then used to guide the coordination policy of a cooperative multi-intersection TLC MARL system on CityFlow-based multi-intersection benchmarks. Our framework significantly outperforms strong MARL baselines in traffic efficiency, safety, and an energy-related proxy. We further highlight C2T’s flexibility in principle, allowing distinct “efficiency-focused” versus “safety-focused” policies by modifying the LLM prompt.
[MA-8] Form Without Function: Agent Social Behavior in the Moltbook Network
【速读】:该论文旨在揭示生成式 AI (Generative AI) 社交平台中“形式存在但功能缺失”的社会技术系统特征,即尽管平台复刻了人类社交网络的结构与交互形式,但其核心的社会互动功能(如互惠性、论辩关联性和内容一致性)严重失效。解决方案的关键在于识别并分析三个层次的失效机制:在交互层,AI代理间缺乏持续参与和有效对话(如91.4%的发帖者不回帖、97.3%的评论无点赞);在内容层,代理行为与其身份描述脱节(97.9%的代理未在其生物信息匹配的社区发帖),且内容高度同质化(92.5%的社区话题比例均衡);在指令层,软性引导被忽略,仅硬性约束(如速率限制、内容过滤)能触发即时行为改变。论文进一步指出,此类平台存在显著技术风险(如凭证泄露、恶意攻击 discourse),且质量过滤机制本身失效,表明当前生成式 AI 社交系统的“社会层”未能形成,导致其无法实现真正意义上的社会功能。
链接: https://arxiv.org/abs/2604.13052
作者: Saber Zerhoudi,Kanishka Ghosh Dastidar,Felix Klement,Artur Romazanov,Andreas Einwiller,Dang H. Dang,Michael Dinzinger,Michael Granitzer,Annette Hautli-Janisz,Stefan Katzenbeisser,Florian Lemmerich,Jelena Mitrovic
机构: University of Passau (帕绍大学); IT:U Austria (IT:U奥地利)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:Moltbook is a social network where every participant is an AI agent. We analyze 1,312,238 posts, 6.7~million comments, and over 120,000 agent profiles across 5,400 communities, collected over 40 days (January 27 to March 9, 2026). We evaluate the platform through three layers. At the interaction layer, 91.4% of post authors never return to their own threads, 85.6% of conversations are flat (no reply ever receives a reply), the median time-to-first-comment is 55 seconds, and 97.3% of comments receive zero upvotes. Interaction reciprocity is 3.3%, compared to 22-60% on human platforms. An argumentation analysis finds that 64.6% of comment-to-post relations carry no argumentative connection. At the content layer, 97.9% of agents never post in a community matching their bio, 92.5% of communities contain every topic in roughly equal proportions, and over 80% of shared URLs point to the platform’s own infrastructure. At the instruction layer, we use 41 Wayback Machine snapshots to identify six instruction changes during the observation window. Hard constraints (rate limit, content filters) produce immediate behavioral shifts. Soft guidance (upvote good posts'', stay on topic’') is ignored until it becomes an explicit step in the executable checklist. The platform also poses technological risks. We document credential leaks (API keys, JWT tokens), 12,470 unique Ethereum addresses with 3,529 confirmed transaction histories, and attack discourse ranging from template-based SSH brute-forcing to multi-agent offensive security architectures. These persist unmoderated because the quality-filtering mechanisms are themselves non-functional. Moltbook is a socio-technical system where the technical layer responds to changes, but the social layer largely fails to emerge. The form of social media is reproduced in full. The function is absent.
[MA-9] Modality-Native Routing in Agent -to-Agent Networks: A Multimodal A2A Protocol Extension
【速读】:该论文旨在解决多智能体系统中跨模态推理的准确性问题,即如何在不同智能体(Agent)之间有效保留和传递多种模态信号(如语音、图像、文本),以支持更精准的跨模态推理。现有方法常采用“文本瓶颈”(text-bottleneck)策略,将所有模态信息压缩为文本形式进行传输,导致关键信息丢失。解决方案的关键在于提出MMA2A架构,通过基于Agent Card能力声明的原生模态路由机制(modality-native routing),在Agent-to-Agent(A2A)网络中直接按原始模态(voice, image, text)传递信息,从而保留更丰富的上下文。实验证明,该方案在CrossModal-CS基准上使任务完成准确率从32%提升至52%,尤其在视觉依赖任务上提升显著(如产品缺陷报告+38.5个百分点),但前提是下游推理智能体具备利用这些丰富上下文的能力——这揭示了路由协议与推理能力之间的协同必要性,即路由是多智能体系统设计中的首要变量。
链接: https://arxiv.org/abs/2604.12213
作者: Vasundra Srinivasan
机构: Stanford School of Engineering (斯坦福大学工程学院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 14 pages, 4 figures (TikZ). PDFLaTeX. Supplementary code and experiment artifacts: this https URL
Abstract:Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on \Delta TCA: [8, 32] pp; McNemar’s exact p = 0.006 ). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a 1.8\times latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning. Comments: 14 pages, 4 figures (TikZ). PDFLaTeX. Supplementary code and experiment artifacts: this https URL Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE) Cite as: arXiv:2604.12213 [cs.AI] (or arXiv:2604.12213v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.12213 Focus to learn more arXiv-issued DOI via DataCite
[MA-10] When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation
【速读】:该论文旨在解决生成式 AI(Generative AI)在多智能体社会模拟中因强化推理能力而导致的行为模拟失真问题,即模型可能从“行为采样器”退化为“策略求解器”,从而破坏模拟的真实性。其核心解决方案在于引入“有限反思”(bounded reflection)机制,通过限制模型的推理深度和范围,使其更倾向于产生多样化且以妥协为导向的行为轨迹,而非过度优化于占优策略。实验证明,在三个不同场景下,采用有限反思的模型比无反思或原生推理模式更能维持行为多样性与结果层面的真实性,凸显了在行为模拟任务中区分“求解能力”与“采样能力”的重要性。
链接: https://arxiv.org/abs/2604.11840
作者: Sandro Andric
机构: New York University (纽约大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 12 pages, 5 figures, supplementary material included as ancillary file
Abstract:Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should improve simulation fidelity. We argue that this assumption can fail when the objective is not to solve a strategic problem, but to sample plausible boundedly rational behavior. In such settings, reasoning-enhanced models can become better solvers and worse simulators: they can over-optimize for strategically dominant actions, collapse compromise-oriented terminal behavior, and sometimes exhibit a diversity-without-fidelity pattern in which local variation survives without outcome-level fidelity. We study this solver-sampler mismatch in three multi-agent negotiation environments adapted from earlier simulation work: an ambiguous fragmented-authority trading-limits scenario, an ambiguous unified-opposition trading-limits scenario, and a new-domain grid-curtailment case in emergency electricity management. We compare three reflection conditions, no reflection, bounded reflection, and native reasoning, across two primary model families and then extend the same protocol to direct OpenAI runs with GPT-4.1 and GPT-5.2. Across all three experiments, bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. In the direct OpenAI extension, GPT-5.2 native ends in authority decisions in 45 of 45 runs across the three experiments, while GPT-5.2 bounded recovers compromise outcomes in every environment. The contribution is not a claim that reasoning is generally harmful. It is a methodological warning: model capability and simulation fidelity are different objectives, and behavioral simulation should qualify models as samplers, not only as solvers.
[MA-11] Bridging Protocol and Production: Design Patterns for Deploying AI Agents with Model Context Protocol
【速读】:该论文旨在解决模型上下文协议(Model Context Protocol, MCP)在生产环境中缺乏对AI代理安全调用外部工具的标准化机制问题,具体表现为身份传递、自适应工具预算分配和结构化错误语义三个关键协议级原语的缺失。解决方案的核心在于提出三项可验证的机制:(1) 上下文感知代理协议(Context-Aware Broker Protocol, CABP),通过六阶段代理流水线实现基于身份作用域的请求路由;(2) 自适应超时预算分配(Adaptive Timeout Budget Allocation, ATBA),将顺序工具调用建模为异构延迟分布下的预算分配问题;(3) 结构化错误恢复框架(Structured Error Recovery Framework, SERF),提供机器可读的失败语义以支持代理的确定性自我修正。这些机制共同填补了MCP在生产级可靠性方面的空白,并通过五维故障分类与可复现实验方法验证其有效性。
链接: https://arxiv.org/abs/2603.13417
作者: Vasundra Srinivasan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 23 pages, 5 figures, 4 tables
Abstract:The Model Context Protocol (MCP) standardizes how AI agents discover and invoke external tools, with over 10,000 active servers and 97 million monthly SDK downloads as of early 2026. Yet MCP does not yet standardize how agents safely operate those tools at production scale. Three protocol-level primitives remain missing: identity propagation, adaptive tool budgeting, and structured error semantics. This paper identifies these gaps through field lessons from an enterprise deployment of an AI agent platform integrated with a major cloud provider’s MCP servers (client name redacted). We propose three mechanisms to fill them: (1) the Context-Aware Broker Protocol (CABP), which extends JSON-RPC with identity-scoped request routing via a six-stage broker pipeline; (2) Adaptive Timeout Budget Allocation (ATBA), which frames sequential tool invocation as a budget allocation problem over heterogeneous latency distributions; and (3) the Structured Error Recovery Framework (SERF), which provides machine-readable failure semantics that enable deterministic agent self-correction. We organize production failure modes into five design dimensions (server contracts, user context, timeouts, errors, and observability), document concrete failure vignettes, and present a production readiness checklist. All three algorithms are formalized as testable hypotheses with reproducible experimental methodology. Field observations demonstrate that while MCP provides a solid protocol foundation, reliable agent tool integration requires infrastructure-level mechanisms that the specification does not yet address.
自然语言处理
[NLP-0] SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments
【速读】: 该论文旨在解决3D空间推理任务中因几何标注成本高昂而导致模型持续改进受限的问题。传统自进化(self-evolving)方法依赖模型共识生成伪标签,容易放大而非纠正模型自身的几何错误。其解决方案的关键在于提出一个名为SpatialEvo的框架,核心是构建Deterministic Geometric Environment(DGE),该环境利用点云和相机位姿可精确计算真实答案的确定性特性,将未标注的3D场景转化为零噪声的交互式 oracle,从而用客观物理反馈替代模型共识;在此基础上,通过共享参数策略在提问者与求解者角色间协同进化,并结合任务自适应调度器动态聚焦于模型薄弱类别,实现无需人工设计的动态课程学习。
链接: https://arxiv.org/abs/2604.14144
作者: Dinging Li,Yingxiu Zhao,Xinrui Cheng,Kangheng Lin,Hongbo Peng,Hongxing Li,Zixuan Wang,Yuhong Dai,Haodong Li,Jia Wang,Yukang Shi,Liang Zhao,Jianjian Sun,Zheng Ge,Xiangyu Zhang,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
机构: Zhejiang University (浙江大学); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model’s own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model’s weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.
[NLP-1] From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的局限性问题,即其优化能力受限于基础模型输出分布 P(y) 的固有约束,难以实现对推理能力的根本提升。为突破这一瓶颈,论文提出在预训练空间(Pre-train Space)中直接优化 P(y),从而增强模型的推理潜力并保留广泛的探索能力。解决方案的关键在于引入 PreRL(Pre-train Space RL),通过奖励驱动的在线更新机制直接调整 P(y),并发现负样本强化(Negative Sample Reinforcement, NSR)是驱动推理能力跃迁的核心机制:NSR 能够快速修剪错误推理空间,并激发内生的反思行为,使推理和反思思维次数分别提升 14.89 倍和 6.54 倍。在此基础上,进一步提出 Dual Space RL(DSRL),采用“策略重生”策略,先用 NSR-PreRL 初始化模型以扩展推理范围,再转入标准 RL 进行精细化优化,实验证明该方法能持续优于强基线,验证了预训练空间中的策略修剪可有效引导策略收敛至更优的正确推理子空间。
链接: https://arxiv.org/abs/2604.14142
作者: Yuqiao Tan,Minzheng Wang,Bo Liu,Zichen Liu,Tian Liang,Shizhu He,Jun Zhao,Kang Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); National University of Singapore (新加坡国立大学); Tencent AI Lab (腾讯人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Our code is available at this https URL
Abstract:While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model’s existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.
[NLP-2] From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)评估中基准分数与实际应用价值脱节的问题,即传统评测指标难以反映模型在真实场景中的有用性。其核心挑战在于用户常依赖“ vibe-testing”——一种基于个人经验的非结构化、主观的评估方式,如通过与自身工作流相关的编码任务比较模型表现,但此类方法缺乏系统性和可复现性。解决方案的关键在于将 vibe-testing 形式化为一个两阶段过程:一是用户个性化地设计测试内容(prompt),二是采用用户感知的主观标准评判输出结果。作者构建了一个概念验证评估流程,通过生成个性化提示并结合用户导向的评价准则进行模型对比,在编码基准上验证了该方法能显著改变模型偏好顺序,从而证明形式化的 vibe-testing 可作为连接基准分数与现实体验的有效桥梁。
链接: https://arxiv.org/abs/2604.14137
作者: Itay Itzhak,Eliya Habba,Gabriel Stanovsky,Yonatan Belinkov
机构: Technion – Israel Institute of Technology (以色列理工学院); The Hebrew University of Jerusalem (耶路撒冷希伯来大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: TLDR: Under review. 42 pages, 18 figures. Code and data at this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract:Evaluating LLMs is challenging, as benchmark scores often fail to capture models’ real-world usefulness. Instead, users often rely on ``vibe-testing’': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.
[NLP-3] Rhetorical Questions in LLM Representations: A Linear Probing Study ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)如何内部表征修辞性疑问句(rhetorical questions)这一问题。修辞性疑问句并非用于获取信息,而是用于说服或表达立场,其在LLM中的表示机制尚不明确。研究的关键在于通过在线性探测器(linear probes)上对两个具有不同话语语境的社会媒体数据集进行分析,发现修辞信号在模型中早期出现,并且最稳定地由最后一 token 的表示捕获;同时,修辞性疑问句在单个数据集中可被线性区分,且在跨数据集迁移时仍能保持一定检测性能(AUROC约为0.7–0.8)。然而,作者进一步指出,这种迁移能力并不意味着存在共享的表示空间——不同数据集训练的探测器在相同目标语料上的排名差异显著,Top实例重叠度常低于0.2,且定性分析表明这些差异对应于不同的修辞现象:一类捕捉整体话语层面的立场表达,另一类则聚焦局部语法驱动的疑问行为。因此,结论是LLM中修辞性疑问句的编码由多个线性方向组成,各自强调不同线索,而非单一共享方向。
链接: https://arxiv.org/abs/2604.14128
作者: Louie Hong Yao,Vishesh Anand,Yuan Zhuang,Tianyu Jiang
机构: University of Cincinnati (辛辛那提大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 15 figures, accepted to ACL 2026
Abstract:Rhetorical questions are asked not to seek information but to persuade or signal stance. How large language models internally represent them remains unclear. We analyze rhetorical questions in LLM representations using linear probes on two social-media datasets with different discourse contexts, and find that rhetorical signals emerge early and are most stably captured by last-token representations. Rhetorical questions are linearly separable from information-seeking questions within datasets, and remain detectable under cross-dataset transfer, reaching AUROC around 0.7-0.8. However, we demonstrate that transferability does not simply imply a shared representation. Probes trained on different datasets produce different rankings when applied to the same target corpus, with overlap among the top-ranked instances often below 0.2. Qualitative analysis shows that these divergences correspond to distinct rhetorical phenomena: some probes capture discourse-level rhetorical stance embedded in extended argumentation, while others emphasize localized, syntax-driven interrogative acts. Together, these findings suggest that rhetorical questions in LLM representations are encoded by multiple linear directions emphasizing different cues, rather than a single shared direction.
[NLP-4] Correct Prediction Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中存在的两类复杂缺陷:Step Internal Flaws(如逻辑错误、幻觉等内部步骤错误)和 Step-wise Flaws(如过度思考或思考不足的步骤偏差),这些问题在不同样本间表现出显著差异。针对这一挑战,作者提出了一种统一框架 CRAFT,其核心创新在于构建一个基于多条候选推理路径共识部分的推理知识图谱(Reasoning Knowledge Graph, RKG),并通过拓扑生成方式合成高质量的推理轨迹,从而有效缓解上述两类缺陷。实验表明,该方法在多个逻辑与数学推理基准测试中均显著优于现有基线,平均提升标签预测准确率超10%,且在多维度上提升了LLM推理轨迹的质量。
链接: https://arxiv.org/abs/2604.14121
作者: Zipeng Ling,Shuliang Liu,Shenghong Fu,Yuehao Tang,Seonil Son,Yao Wan,Xuming Hu
机构: Hong Kong University of Science and Technology (Guangzhou); University of Pennsylvania; Huazhong University of Science and Technology; Hong Kong Polytechnic University; RLWRLD
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM reasoning traces suffer from complex flaws – Step Internal Flaws (logical errors, hallucinations, etc.) and Step-wise Flaws (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs’ reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the consensus parts of multiple candidate traces, and synthesizes a high-quality trace through topological generation. Our approach improves label-prediction accuracy by 10+% on average, and consistently outperforms all baselines across both logical and mathematical reasoning benchmarks. Further, detailed benchmark evaluation proves that our method also improves the quality of LLMs’ reasoning traces in multiple dimensions.
[NLP-5] REX: Automating LLM Fine-tuning via Agent -Driven Tree-based Exploration
【速读】: 该论文旨在解决如何自动化复杂、现实世界中大型语言模型(Large Language Models, LLMs)训练全流程的问题,当前LLM研究代理虽能执行孤立科学任务,但尚无法有效完成从需求分析到模型训练与评估的端到端流程。其解决方案的关键在于提出TREX这一多智能体系统,通过“研究员”(Researcher)与“执行者”(Executor)两个核心模块的协同工作,实现需求解析、开放域文献与数据调研、训练策略制定、数据配方准备及模型训练与评估的全流程自动化;同时将多轮实验过程建模为搜索树结构,支持高效探索路径规划、历史结果复用和迭代试验中的高阶知识提炼,从而显著提升自动化训练的有效性与效率。
链接: https://arxiv.org/abs/2604.14116
作者: Zerun Ma,Guoqiang Wang,Xinchen Xie,Yicheng Chen,He Du,Bowen Li,Yanan Sun,Wenran Liu,Kai Chen,Yining Li
机构: Shanghai AI Laboratory(上海人工智能实验室); Fudan University(复旦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi-round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high-level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT-Bench, a benchmark comprising 10 tasks derived from real-world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain-specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.
[NLP-6] UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
【速读】: 该论文旨在解决GUI grounding任务中因小图标和密集布局导致的界面元素定位困难问题,尤其针对现有测试时缩放(test-time zoom-in)方法在应用上存在的局限性——即对所有实例采用固定尺寸统一裁剪,未能根据模型对每个实例的实际不确定性动态调整。解决方案的关键在于提出一种无需训练的自适应缩放框架UI-Zoomer,其核心创新是将缩放触发条件与缩放尺度均建模为预测不确定性量化问题:通过一个基于置信度的门控机制融合空间一致性与token级生成置信度,仅在定位不确定时触发缩放;进一步地,利用全方差定律分解预测方差为样本间位置扩散与单样本框范围扩展,从而为每个实例自适应确定最优裁剪半径,实现精准、高效的局部高分辨率推理。
链接: https://arxiv.org/abs/2604.14113
作者: Fei Tang,Bofan Chen,Zhengxi Lu,Tongbo Chen,Songqin Nong,Tao Jiang,Wenhao Xu,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
机构: 浙江大学(Zhejiang University)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL Code: this https URL
Abstract:GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbfUI-Zoomer, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4%, +10.3%, and +4.2% respectively, with no additional training required.
[NLP-7] Interpretable Stylistic Variation in Human and LLM Writing Across Genres Models and Decoding Strategies
【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成文本与人类写作在风格上的差异问题,尤其是当前对机器生成文本的检测研究较多,但对其语言风格特征的理解仍较为有限。解决方案的关键在于通过大规模分析11个大型语言模型(LLMs)在8种不同体裁和4种解码策略下的文本,利用Douglas Biber提出的词汇语法和功能特征集进行系统性比较,从而揭示影响文本风格的核心因素。研究发现:模型类型对风格的影响大于解码策略,体裁比来源(人类或机器)更具影响力,且聊天类模型在风格空间中呈现聚集趋势,这些洞见为有意控制和优化LLM输出风格提供了实证依据。
链接: https://arxiv.org/abs/2604.14111
作者: Swati Rallapalli,Shannon Gallagher,Ronald Yurko,Tyler Brooks,Chuck Loughin,Michele Sezgin,Violet Turri
机构: Software Engineering Institute, AI Division, Carnegie Mellon University (卡内基梅隆大学软件工程研究所,人工智能部门); Department of Statistics Data Science, Carnegie Mellon University (卡内基梅隆大学统计与数据科学系)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are now capable of generating highly fluent, human-like text. They enable many applications, but also raise concerns such as large scale spam, phishing, or academic misuse. While much work has focused on detecting LLM-generated text, only limited work has gone into understanding the stylistic differences between human-written and machine-generated text. In this work, we perform a large scale analysis of stylistic variation across human-written text and outputs from 11 LLMs spanning 8 different genres and 4 decoding strategies using Douglas Biber’s set of lexicogrammatical and functional features. Our findings reveal insights that can guide intentional LLM usage. First, key linguistic differentiators of LLM-generated text seem robust to generation conditions (e.g., prompt settings to nudge them to generate human-like text, or availability of human-written text to continue the style); second, genre exerts a stronger influence on stylistic features than the source itself; third, chat variants of the models generally appear to be clustered together in stylistic space, and finally, model has a larger effect on the style than decoding strategy, with some exceptions. These results highlight the relative importance of model and genre over prompting and decoding strategies in shaping the stylistic behavior of machine-generated text.
[NLP-8] From Weights to Activations: Is Steering the Next Frontier of Adaptation? ACL2026
【速读】: 该论文旨在解决当前语言模型后训练适应方法(如微调、参数高效适配和提示工程)与推理时激活值干预方法(即“steering”)缺乏统一理论框架的问题。其解决方案的关键在于提出一套功能性的适应标准,并据此将steering重新定义为一种独立的模型适应范式——它通过在激活空间中进行靶向干预,实现无需参数更新的局部且可逆的行为改变,从而推动建立一个涵盖所有适应方法的统一分类体系。
链接: https://arxiv.org/abs/2604.14090
作者: Simon Ostermann,Daniil Gurgurov,Tanja Baeumel,Michael A. Hedderich,Sebastian Lapuschkin,Wojciech Samek,Vera Schmitt
机构: Saarland University; German Research Center for Artificial Intelligence (DFKI); Centre for European Research in Trusted AI (CERTAIN); Center for Information and Language Processing, LMU Munich; Munich Center for Machine Learning; Fraunhofer Heinrich Hertz Institute; Technological University Dublin; Technische Universität Berlin; Berlin Institute for the Foundations of Learning and Data (BIFOLD)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (Main)
Abstract:Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods. In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing clarifies how steering relates to existing methods, motivating a unified taxonomy for model adaptation. Comments: Accepted to ACL 2026 (Main) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.14090 [cs.CL] (or arXiv:2604.14090v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.14090 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-9] π-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
【速读】: 该论文旨在解决深度搜索代理(Deep Search Agents)在训练过程中面临的稀疏奖励(sparse rewards)、信用分配困难(weak credit assignment)以及标注数据稀缺等问题。传统自对弈(self-play)方法仅依赖最终任务结果进行稀疏奖励优化,导致学习效率低下。其解决方案的关键在于发现并利用自对弈过程中自然生成的“问题构建路径”(Question Construction Path, QCP),这是一种捕捉逆向解题过程的中间产物,可作为教师模型的特权信息(privileged information)来密集监督学生模型的自蒸馏(self-distillation)。由此提出Privileged Information Self-Play(π-Play)框架,通过 examiner 与 teacher 模型协同生成带 QCP 的任务,使原本稀疏反馈的自对弈机制转变为密集反馈的自进化循环,从而显著提升训练效率与性能。
链接: https://arxiv.org/abs/2604.14054
作者: Yaocheng Zhang,Yuanheng Zhu,Wenyue Chong,Songjun Tu,Qichao Zhang,Jiajun Chai,Xiaohan Wang,Wei Lin,Guojun Yin,Dongbin Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences (中国科学院大学交叉学科研究院); Meituan (美团); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 26 pages, 12 figures
Abstract:Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ( \pi -Play), a multi-agent self-evolution framework. In \pi -Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free \pi -Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3 \times over conventional self-play.
[NLP-10] From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution
【速读】: 该论文旨在解决代码领域中分词器(tokenization)效率低下和安全性不足的问题,特别是由于训练数据在仓库(repository)和编程语言多样性上的不平衡,导致生成大量未被充分训练的“无用令牌”(unused tokens),这些令牌不仅浪费计算资源,还可能增加越狱攻击(jailbreak attacks)和幻觉(hallucinations)的风险。解决方案的关键在于提出一种名为源属性BPE(Source-Attributed BPE, SA-BPE)的方法,通过修改BPE(Byte Pair Encoding)的目标函数并引入合并跳过(merge skipping)机制,在不改变推理流程的前提下对BPE训练进行正则化,从而显著减少无用令牌数量,提升模型在生产环境中的效率与鲁棒性。
链接: https://arxiv.org/abs/2604.14053
作者: Pavel Chizhov,Egor Bogomolov,Ivan P. Yamshchikov
机构: CAIRO, Technical University of Applied Sciences Würzburg-Schweinfurt (CAIRO,维尔茨堡-施韦因富特应用技术大学); JetBrains Research (JetBrains 研究院); TU Delft (代尔夫特理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak attacks and lowers the risk of hallucinations. In this work, we investigate the efficiency of code tokenization, in particular from the perspective of data source diversity. We demonstrate that code tokenizers are prone to producing unused, and thus under-trained, tokens due to the imbalance in repository and language diversity in the training data, as well as the dominance of source-specific, repetitive tokens that are often unusable in future inference. By modifying the BPE objective and introducing merge skipping, we implement different techniques under the name Source-Attributed BPE (SA-BPE) to regularize BPE training and minimize overfitting, thereby substantially reducing the number of under-trained tokens while maintaining the same inference procedure as with regular BPE. This provides an effective tool suitable for production use.
[NLP-11] Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning
【速读】: 该论文旨在解决大规模语言模型在监督微调(Supervised Fine-Tuning, SFT)过程中普遍存在的任务干扰(task interference)和灾难性遗忘(catastrophic forgetting)问题。传统方法通过静态隔离关键参数来缓解上述问题,但其假设参数重要性在训练中保持不变,忽略了学习动态变化的本质。论文的关键创新在于提出演化参数隔离(Evolving Parameter Isolation, EPI)框架,其核心机制是基于梯度信号在线估计参数重要性,并周期性地更新隔离掩码(isolation mask),从而动态保护新出现的任务关键参数,同时释放已过时的参数以恢复模型的可塑性(plasticity)。实验表明,EPI在多任务基准上显著优于静态隔离和标准微调,在减少干扰与遗忘的同时提升了泛化性能。
链接: https://arxiv.org/abs/2604.14010
作者: Zekai Lin,Chao Xue,Di Liang,Xingsheng Han,Peiyang Liu,Xianjie Wu,Lei Jiang,Yu Lu,Haibo Shi,Shuang Liang,Minlong Peng
机构: Tencent Hunyuan(腾讯混元); Tencent Yuanbao(腾讯元包); Peking University(北京大学); UESTC, China(电子科技大学, 中国); University of New South Wales(新南威尔士大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Supervised Fine-Tuning (SFT) of large language models often suffers from task interference and catastrophic forgetting. Recent approaches alleviate this issue by isolating task-critical parameters during training. However, these methods represent a static solution to a dynamic problem, assuming that parameter importance remains fixed once identified. In this work, we empirically demonstrate that parameter importance exhibits temporal drift over the course of training. To address this, we propose Evolving Parameter Isolation (EPI), a fine-tuning framework that adapts isolation decisions based on online estimates of parameter importance. Instead of freezing a fixed subset of parameters, EPI periodically updates isolation masks using gradient-based signals, enabling the model to protect emerging task-critical parameters while releasing outdated ones to recover plasticity. Experiments on diverse multi-task benchmarks demonstrate that EPI consistently reduces interference and forgetting compared to static isolation and standard fine-tuning, while improving overall generalization. Our analysis highlights the necessity of synchronizing isolation mechanisms with the evolving dynamics of learning diverse abilities.
[NLP-12] Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents
【速读】: 该论文旨在解决当前基于记忆的自进化编码代理(coding agents)在实际应用中受限于单一任务领域的问题,即现有方法通常仅在同质任务域内使用记忆,未能充分利用跨不同真实世界编码问题所共有的基础设施(如运行环境和编程语言)。其解决方案的关键在于提出记忆迁移学习(Memory Transfer Learning, MTL),通过构建来自异构领域的统一记忆池来实现跨域知识迁移。实验表明,这种机制能提升平均性能3.7%,主要得益于元知识(如验证流程)的迁移而非具体代码片段;同时发现抽象层次决定了迁移效果——高层级抽象信息具有良好的泛化能力,而低层级的具体执行轨迹常导致负迁移;此外,记忆迁移的有效性随记忆池规模增大而增强,并且可在不同模型间实现迁移,为拓展记忆利用范围提供了实证设计原则。
链接: https://arxiv.org/abs/2604.14004
作者: Kangsan Kim,Minki Kang,Taeil Kim,Yanlai Yang,Mengye Ren,Sung Ju Hwang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains, failing to leverage the shared infrastructural foundations, such as runtime environments and programming languages, that exist across diverse real-world coding problems. To address this limitation, we investigate \textbfMemory Transfer Learning (MTL) by harnessing a unified memory pool from heterogeneous domains. We evaluate performance across 6 coding benchmarks using four memory representations, ranging from concrete traces to abstract insights. Our experiments demonstrate that cross-domain memory improves average performance by 3.7%, primarily by transferring meta-knowledge, such as validation routines, rather than task-specific code. Importantly, we find that abstraction dictates transferability; high-level insights generalize well, whereas low-level traces often induce negative transfer due to excessive specificity. Furthermore, we show that transfer effectiveness scales with the size of the memory pool, and memory can be transferred even between different models. Our work establishes empirical design principles for expanding memory utilization beyond single-domain silos. Project page: this https URL
[NLP-13] Diffusion Language Models for Speech Recognition
【速读】: 该论文旨在解决自动语音识别(ASR)中语言模型建模能力不足的问题,尤其是在提升识别准确率方面。其核心解决方案是引入扩散语言模型(Diffusion Language Models, DLMs),特别是掩码扩散语言模型(Masked Diffusion Language Models, MDLMs)和均匀状态扩散模型(Uniform-State Diffusion Models, USDMs),用于对ASR候选结果进行重打分(rescoring)。关键创新在于设计了一种联合解码方法,将连接时序分类(CTC)的帧级概率分布与USDM的标签级概率分布在每一步解码中融合,从而生成结合强语言先验知识与声学信息的新候选序列,显著提升了ASR系统的准确性。
链接: https://arxiv.org/abs/2604.14001
作者: Davyd Naveriani,Albert Zeyer,Ralf Schlüter,Hermann Ney
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.
[NLP-14] Reward Design for Physical Reasoning in Vision-Language Models
【速读】: 该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在物理推理任务中表现远低于人类水平的问题,特别是如何通过奖励设计优化VLM的多步骤符号推理能力。其解决方案的关键在于系统性地评估不同语义丰富度的奖励信号对GRPO(Group Relative Policy Optimization)训练框架下VLM物理推理行为的影响,发现奖励类型显著影响模型在不同物理领域中的推理模式:基于准确率的奖励带来整体最优性能,基于评分量表(rubric)的奖励提升结构化推理质量但不保证准确率提升,而基于模型注意力权重的内部奖励无需空间标注即可显著增强空间关系推理能力(从0.27提升至0.50),表明监督模型生成过程中的关注区域是提升视觉接地物理推理的有效路径。
链接: https://arxiv.org/abs/2604.13993
作者: Derek Lilienthal,Manisha Mukherjee,Sameera Horawalavithana
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions. We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.
[NLP-15] Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成内容存在事实性错误的问题,尤其是现有基于置信度校准(conformal prediction)的方法缺乏提示自适应性(prompt-adaptive),导致在不同输入下覆盖范围不一致——即过覆盖(over-coverage)或欠覆盖(under-coverage)。其解决方案的关键在于提出一种自适应的置信度校准方法,通过扩展传统的置信度分数变换(conformal score transformation)技术来适配LLMs,并实现提示依赖的校准(prompt-dependent calibration)。该方法在保持边际覆盖保证(marginal coverage guarantees)的同时显著提升条件覆盖性能(conditional coverage),并天然支持选择性预测(selective prediction),从而在下游任务中过滤不可靠输出。
链接: https://arxiv.org/abs/2604.13991
作者: Aleksandr Rubashevskii,Dzianis Piatrashyn,Preslav Nakov,Maxim Panov
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are prone to generating factually incorrect outputs. Recent work has applied conformal prediction to provide uncertainty estimates and statistical guarantees for the factuality of LLM generations. However, existing approaches are typically not prompt-adaptive, limiting their ability to capture input-dependent variability. As a result, they may filter out too few items (leading to over-coverage) or too many (under-coverage) for a given task or prompt. We propose an adaptive conformal prediction approach that extends conformal score transformation methods to LLMs, with applications to long-form generation and multiple-choice question answering. This enables prompt-dependent calibration, retaining marginal coverage guarantees while improving conditional coverage. In addition, the approach naturally supports selective prediction, allowing unreliable claims or answer choices to be filtered out in downstream applications. We evaluate our approach on multiple white-box models across diverse domains and show that it significantly outperforms existing baselines in terms of conditional coverage.
[NLP-16] Leverag ing LLM -GNN Integration for Open-World Question Answering over Knowledge Graphs EACL ACL
【速读】: 该论文旨在解决开放世界知识图谱问答(Open-world Question Answering over Knowledge Graphs, OW-QA)中的关键挑战,即在知识图谱不完整或动态演化的情况下,如何有效结合结构化推理与语义理解来准确回答问题。传统闭世界假设限制了实际应用,而现有混合方法要么依赖结构嵌入缺乏语义基础,要么假设路径完整或图结构完备,难以应对缺失链接或多跳推理场景。解决方案的关键在于提出GLOW系统,通过预训练图神经网络(Graph Neural Networks, GNNs)从图结构中预测top-k候选答案,并将这些候选答案及相关KG事实以结构化提示(structured prompt)形式输入大语言模型(Large Language Models, LLMs),从而实现符号与语义信号的联合推理,无需额外检索或微调即可提升开放世界下的问答准确性。
链接: https://arxiv.org/abs/2604.13979
作者: Hussein Abdallah,Ibrahim Abdelaziz,Panos Kalnis,Essam Mansour
机构: Concordia University (康考迪亚大学); IBM (国际商业机器公司); KAUST (沙特阿卜杜拉国王科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 18 pages,6 figures,10 tables. this https URL
Abstract:Open-world Question Answering (OW-QA) over knowledge graphs (KGs) aims to answer questions over incomplete or evolving KGs. Traditional KGQA assumes a closed world where answers must exist in the KG, limiting real-world applicability. In contrast, open-world QA requires inferring missing knowledge based on graph structure and context. Large language models (LLMs) excel at language understanding but lack structured reasoning. Graph neural networks (GNNs) model graph topology but struggle with semantic interpretation. Existing systems integrate LLMs with GNNs or graph retrievers. Some support open-world QA but rely on structural embeddings without semantic grounding. Most assume observed paths or complete graphs, making them unreliable under missing links or multi-hop reasoning. We present GLOW, a hybrid system that combines a pre-trained GNN and an LLM for open-world KGQA. The GNN predicts top-k candidate answers from the graph structure. These, along with relevant KG facts, are serialized into a structured prompt (e.g., triples and candidates) to guide the LLM’s reasoning. This enables joint reasoning over symbolic and semantic signals, without relying on retrieval or fine-tuning. To evaluate generalization, we introduce GLOW-BENCH, a 1,000-question benchmark over incomplete KGs across diverse domains. GLOW outperforms existing LLM-GNN systems on standard benchmarks and GLOW-BENCH, achieving up to 53.3% and an average 38% improvement. GitHub code and data are available.
[NLP-17] How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design Generator Model and Source Data
【速读】: 该论文旨在解决大规模语言模型训练中合成数据(synthetic data)设计维度不明确的问题,特别是 rephrasing 策略、生成模型规模和原始数据来源对合成数据质量的影响缺乏系统性比较。其解决方案的关键在于通过控制变量的大量实验(生成超一万亿 token),发现结构化输出格式(如表格、数学问题、常见问题解答和教程)显著优于传统方法,并且生成模型参数量超过 10 亿时不再带来性能提升;同时指出原始数据混合策略对最终效果影响显著。基于这些发现,作者构建了 4860 亿 token 的开源数据集 FinePhrase,该数据集在性能上超越现有合成数据基线,同时将生成成本降低至原先的 1/30。
链接: https://arxiv.org/abs/2604.13977
作者: Joel Niklaus,Atsuki Yamaguchi,Michal Štefánik,Guilherme Penedo,Hynek Kydlíček,Elie Bakouch,Lewis Tunstall,Edward Emanuel Beeching,Thibaud Frere,Colin Raffel,Leandro von Werra,Thomas Wolf
机构: Hugging Face; University of Sheffield; National Institute of Informatics
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \textbf\textscFinePhrase, a 486-billion-token open dataset of rephrased web text. We show that \textscFinePhrase outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.
[NLP-18] Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs
【速读】: 该论文旨在解决英语语法理论中长期存在的“句法岛屿”(syntactic islands)问题,特别是协调动词短语中的提取现象:尽管从并列动词结构中提取通常被感知为不自然,但人类对这类句子的接受度呈现梯度变化(如“I know what he hates art and loves”比“I know what he looked down and saw”更可接受)。解决方案的关键在于运用因果干预(causal interventions)技术,识别Transformer模型中与语法功能相关的子空间(如注意力模块、MLP层和Transformer块),发现提取行为依赖于与标准wh-依存相同的填空-缺口机制(filler-gap mechanisms),但这些机制在不同构式中受到不同程度的抑制。通过将大规模无关语料投影到这些因果识别的子空间,研究进一步提出一个新颖的语言学假设:连词“and”在可提取与不可提取结构中具有不同的表征——前者体现关系依赖,后者仅为纯粹的联合用法,从而揭示了生成式AI(Generative AI)模型在语法机制解释中的潜力。
链接: https://arxiv.org/abs/2604.13950
作者: Sasha Boguraev,Kyle Mahowald
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, 7 figures, 3 tables
Abstract:We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., “I know what he hates art and loves” vs. “I know what he looked down and saw”). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction “and” is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.
[NLP-19] CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation ACL2026
【速读】: 该论文旨在解决自动化代码生成(Automated Code Generation)在软件工程中长期存在的挑战,包括传统多智能体框架因静态规划、孤立执行、高计算开销以及对复杂任务适应性不足而导致的性能瓶颈。其解决方案的关键在于提出一种名为CollabCoder的新型“计划-编码协同演化”(Plan-Code Co-Evolution)框架,通过设计计划模块与代码模块之间的协同决策机制,动态决定调试过程中由哪个模块主导执行,从而提升代码质量与鲁棒性,同时显著降低计算资源消耗,尤其在高难度基准测试(如LiveCodeBench和xCodeEval)上表现出更强的效率优势。
链接: https://arxiv.org/abs/2604.13946
作者: Duy Tung Doan,Quang Huy Phung,Dzung Nguyen,Khac-Hoai Nam Bui
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: ACL 2026 Findings
Abstract:Automated code generation remains a persistent challenge in software engineering, as conventional multi-agent frameworks are often constrained by static planning, isolated execution, high computational overhead, and limited adaptability to complex tasks. This paper introduces CollabCoder, a novel Plan-Code Co-Evolution framework that improves code generation through dynamic multi-agent collaboration. The core idea is to design a collaborative decision-making process between the plan module and the code module to decide which module should be executed for the debugging process. Extensive experiments on widely used benchmarks demonstrate that CollabCoder consistently improves code quality and robustness across tasks. Importantly, CollabCoder achieves performance comparable to or exceeding current state-of-the-art methods while reducing computational overhead, with efficiency gains becoming more pronounced as benchmark difficulty increases. On the more challenging LiveCodeBench and xCodeEval benchmarks, our approach improves performance by 11-20% over strong baselines while reducing the number of API calls by an average of 4-10 per execution.
[NLP-20] Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection
【速读】: 该论文旨在解决两个核心问题:其一,在主动学习(Active Learning, AL)框架中,生成式 AI(Generative AI)标注是否可替代人工标注;其二,当整个语料库可低成本批量标注时,主动学习是否仍具必要性。解决方案的关键在于通过构建一个包含277,902条德语政治类TikTok评论的大规模数据集(其中25,974条由GPT-5.2标注,5,000条由人工标注),系统比较七种标注策略在四种编码器上的表现,用于检测反移民敌意。研究发现,基于LLM标注的分类器在F1-Macro指标上可媲美人类标注模型,但在错误结构上存在系统性偏差——LLM模型倾向于高估正类样本,尤其在主题模糊、难以区分反移民情绪与政策批评的讨论中更为显著。因此,标注策略的选择不应仅依赖聚合性能指标(如F1),而应结合具体应用场景对误差模式的容忍度进行权衡。
链接: https://arxiv.org/abs/2604.13899
作者: Ahmad Dawar Hakimi,Lea Hirlimann,Isabelle Augenstein,Hinrich Schütze
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); University of Copenhagen (哥本哈根大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels (\ 43) achieves comparable F1-Macro to one trained on 3,800 human annotations (\ 316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.
[NLP-21] Beyond Static Personas: Situational Personality Steering for Large Language Models ACL2026
【速读】: 该论文旨在解决个性化大语言模型(Personalized Large Language Models, PLMs)在实际应用中面临的可控性差、资源消耗高以及静态人格建模导致的情境适应能力弱等问题。其核心解决方案是提出一种无需训练的神经元级情境人格调控框架IRIS,关键在于通过多角度分析发现人格神经元中的情境依赖性和一致的行为模式,并基于此实现情境感知的神经元识别、检索与相似度加权引导,从而在不改变模型参数的前提下,动态调整模型行为以适配不同情境,显著提升模型在复杂未见场景下的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2604.13846
作者: Zesheng Wei,Mengxiang Li,Zilei Wang,Yang Deng
机构: University of Science and Technology of China (中国科学技术大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL)
备注: Appectped to Findings of ACL2026
Abstract:Personalized Large Language Models (LLMs) facilitate more natural, human-like interactions in human-centric applications. However, existing personalization methods are constrained by limited controllability and high resource demands. Furthermore, their reliance on static personality modeling restricts adaptability across varying situations. To address these limitations, we first demonstrate the existence of situation-dependency and consistent situation-behavior patterns within LLM personalities through a multi-perspective analysis of persona neurons. Building on these insights, we propose IRIS, a training-free, neuron-based Identify-Retrieve-Steer framework for advanced situational personality steering. Our approach comprises situational persona neuron identification, situation-aware neuron retrieval, and similarity-weighted steering. We empirically validate our framework on PersonalityBench and our newly introduced SPBench, a comprehensive situational personality benchmark. Experimental results show that our method surpasses best-performing baselines, demonstrating IRIS’s generalization and robustness to complex, unseen situations and different models architecture.
[NLP-22] Robust Reward Modeling for Large Language Models via Causal Decomposition
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在奖励建模(Reward Modeling, RM)过程中对虚假线索(spurious cues)如响应长度和过度讨好语气的过拟合问题,这些问题会导致模型偏离输入提示(prompt)的真实意图。解决方案的关键在于引入一个解码器(decoder),该解码器将候选回答映射到输入提示的潜在意图嵌入(latent intent embedding),并通过重建误差(reconstruction error)作为正则化信号来指导奖励模型训练。该信号理论上能强化与提示相关的特征信息,同时抑制与提示无关的捷径(prompt-independent shortcuts),从而提升奖励模型对真实偏好信号的敏感性与鲁棒性。
链接: https://arxiv.org/abs/2604.13833
作者: Yunsheng Lu,Zijiang Yang,Licheng Pan,Zhixuan Chu
机构: Zhejiang University (浙江大学); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reward models are central to aligning large language models, yet they often overfit to spurious cues such as response length and overly agreeable tone. Most prior work weakens these cues directly by penalizing or controlling specific artifacts, but it does not explicitly encourage the model to ground preferences in the prompt’s intent. We learn a decoder that maps a candidate answer to the latent intent embedding of the input. The reconstruction error is used as a signal to regularize the reward model training. We provide theoretical evidence that this signal emphasizes prompt-dependent information while suppressing prompt-independent shortcuts. Across math, helpfulness, and safety benchmarks, the decoder selects shorter and less sycophantic candidates with 0.877 accuracy. Incorporating this signal into RM training in Gemma-2-2B-it and Gemma-2-9B-it increases RewardBench accuracy from 0.832 to 0.868. For Best-of-N selection, our framework increases length-controlled win rates while producing shorter outputs, and remains robust to lengthening and mild off-topic drift in controlled rewrite tests.
[NLP-23] MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment
【速读】: 该论文旨在解决现有用户模拟器在多领域中文场景下存在的三大问题:浅层用户画像建模、长对话中人格一致性难以维持,以及局限于英语或单一领域设置。其核心解决方案是提出MUSE框架,关键在于引入迭代式用户画像自我演化(Iterative Profile Self-Evolution, IPSE)机制,通过对比仿真轨迹与真实对话行为的差异进行推理优化;同时结合角色反转监督微调(Role-Reversal Supervised Fine-Tuning)提升局部响应的真实性与人类表达自然度,并进一步训练基于评分标准的奖励模型(rubric-based reward model),并将其融入基于评分指导的多轮强化学习(rubric-guided multi-turn reinforcement learning),从而实现细粒度的行为对齐和长程交互一致性优化。
链接: https://arxiv.org/abs/2604.13828
作者: Zihao Liu,Hantao Zhou,Jiguo Li,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Peng Wang
机构: Fudan University (复旦大学); Meituan (美团)
类目: Computation and Language (cs.CL)
备注:
Abstract:User simulators are essential for the scalable training and evaluation of interactive AI systems. However, existing approaches often rely on shallow user profiling, struggle to maintain persona consistency over long interactions, and are largely limited to English or single-domain settings. We present MUSE, a multi-domain Chinese user simulation framework designed to generate human-like, controllable, and behaviorally consistent responses. First, we propose Iterative Profile Self-Evolution (IPSE), which gradually optimizes user profiles by comparing and reasoning discrepancies between simulated trajectories and real dialogue behaviors. We then apply Role-Reversal Supervised Fine-Tuning to improve local response realism and human-like expression. To enable fine-grained behavioral alignment, we further train a specialized rubric-based reward model and incorporate it into rubric-guided multi-turn reinforcement learning, which optimizes the simulator at the dialogue level and enhances long-horizon behavioral consistency. Experiments show that MUSE consistently outperforms strong baselines in both utterance-level and session-level evaluations, generating responses that are more realistic, coherent, and persona-consistent over extended interactions.
[NLP-24] oolOmni: Enabling Open-World Tool Use via Agent ic learning with Proactive Retrieval and Grounded Execution ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放世界场景下使用外部工具时面临的两大挑战:一是现有方法依赖静态嵌入检索或参数化记忆工具,难以准确对齐用户意图与工具语义;二是缺乏对未见过工具的泛化能力,导致工具检索和执行准确率低下。解决方案的关键在于提出一个统一的代理框架 ToolOmni,其核心创新包括:首先构建冷启动多轮交互数据集,通过监督微调(Supervised Fine-Tuning, SFT)赋予模型基础代理能力;其次设计基于解耦多目标广义策略优化(Decoupled Multi-Objective GRPO)的开放世界工具学习机制,在在线环境中协同优化工具检索准确率与执行有效性,从而实现主动检索与 grounded 执行的推理循环。实验表明,ToolOmni 在端到端执行成功率上相比强基线提升 10.8%,并展现出卓越的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2604.13787
作者: Shouzheng Huang,Meishan Zhang,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Computation and Language (cs.CL)
备注: 19 pages, 9 figures, 9 tables, accepted to ACL 2026
Abstract:Large Language Models (LLMs) enhance their problem-solving capability by utilizing external tools. However, in open-world scenarios with massive and evolving tool repositories, existing methods relying on static embedding retrieval or parameter memorization of tools struggle to align user intent with tool semantics or generalize to unseen tools, respectively, leading to suboptimal accuracy of open-world tool retrieval and execution. To address these, we present ToolOmni, a unified agentic framework that enables LLMs for open-world tool use by proactive retrieval and grounded execution within a reasoning loop. First, we construct a cold-start multi-turn interaction dataset to instill foundational agentic capabilities via Supervised Fine-Tuning (SFT). Then, we introduce open-world tool learning based on a Decoupled Multi-Objective GRPO algorithm, which simultaneously optimizes LLMs for both tool retrieval accuracy and execution efficacy in online environments. Extensive experiments demonstrate that ToolOmni achieves state-of-the-art performance both in retrieval and execution, surpassing strong baselines by a significant margin of +10.8% in end-to-end execution success rate, while exhibiting exceptional robustness and generalization capabilities.
[NLP-25] QuantileMark: A Message-Symmetric Multi-bit Watermark for LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在内容生成中实现多比特水印(multi-bit watermarking)时面临的消息对称性(message symmetry)问题,即水印嵌入过程应避免因消息内容不同而导致文本质量或验证结果的系统性偏差。现有基于词汇划分(vocabulary-partition)的水印方法在低熵解码场景下会破坏对称性:某些消息被分配高概率质量区域,而其他消息被迫使用尾部token(tail tokens),导致嵌入质量和检测准确性依赖于具体消息内容。解决方案的关键在于提出QuantileMark——一种白盒多比特水印机制,其核心创新是将水印嵌入从离散词汇空间转移到连续累积概率区间 [0, 1),并在每一步将该区间划分为 M 个等质量子区间(equal-mass bins),严格从目标符号对应的bin中采样,从而保证无论上下文熵如何变化,每个符号均享有固定的 1/M 概率预算。此设计实现了消息无偏性(message-unbiasedness),理论上保障了生成侧的对称性,并通过均匀证据强度提升检测侧鲁棒性,实验表明其在C4续写和LFQA任务上显著优于强基线,同时对生成质量影响可忽略。
链接: https://arxiv.org/abs/2604.13786
作者: Junlin Zhu,Baizhou Huang,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models become standard backends for content generation, practical provenance increasingly requires multi-bit watermarking. In provider-internal deployments, a key requirement is message symmetry: the message itself should not systematically affect either text quality or verification outcomes. Vocabulary-partition watermarks can break message symmetry in low-entropy decoding: some messages are assigned most of the probability mass, while others are forced to use tail tokens. This makes embedding quality and message decoding accuracy message-dependent. We propose QuantileMark, a white-box multi-bit watermark that embeds messages within the continuous cumulative probability interval [0, 1) . At each step, QuantileMark partitions this interval into M equal-mass bins and samples strictly from the bin assigned to the target symbol, ensuring a fixed 1/M probability budget regardless of context entropy. For detection, the verifier reconstructs the same partition under teacher forcing, computes posteriors over latent bins, and aggregates evidence for verification. We prove message-unbiasedness, a property ensuring that the base distribution is recovered when averaging over messages. This provides a theoretical foundation for generation-side symmetry, while the equal-mass design additionally promotes uniform evidence strength across messages on the detection side. Empirical results on C4 continuation and LFQA show improved multi-bit recovery and detection robustness over strong baselines, with negligible impact on generation quality. Our code is available at GitHub (this https URL).
[NLP-26] From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)可能记忆敏感或受版权保护内容所带来的隐私与法律风险,尤其是现有机器遗忘(machine unlearning)方法依赖用户提供的遗忘集合(forget sets),导致请求难以审计、易引发二次泄露及恶意滥用的问题。其解决方案的关键在于提出MAGE框架——一种基于记忆图引导的擦除机制,仅需用户提供一个轻量级锚点(user anchor)即可定位目标实体,通过探测目标模型恢复相关记忆并构建加权局部记忆图,进而自动生成限定范围的监督信号用于遗忘训练;该方法无需原始训练语料库,具备模型无关性且可集成至标准遗忘流程,实验证明其生成的自监督信号能实现与外部参考监督相当的遗忘效果,同时保持模型整体性能不受显著影响。
链接: https://arxiv.org/abs/2604.13777
作者: Wenxuan Li,Zhenfei Zhang,Mi Zhang,Geng Hong,Mi Wen,Xiaoyu You,Min Yang
机构: Fudan University (复旦大学); Shanghai University of Electric Power (上海电力大学); East China University of Science and Technology (华东理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, appendix included
Abstract:Large language models (LLMs) may memorize sensitive or copyrighted content, raising significant privacy and legal concerns. While machine unlearning has emerged as a potential remedy, prevailing paradigms rely on user-provided forget sets, making unlearning requests difficult to audit and exposing systems to secondary leakage and malicious abuse. We propose MAGE, a Memory-grAph Guided Erasure framework for user-minimized, corpus-free unlearning. Given only a lightweight user anchor that identifies a target entity, MAGE probes the target LLM to recover target-related memorization, organizes it into a weighted local memory graph, and synthesizes scoped supervision for unlearning. MAGE is model-agnostic, can be plugged into standard unlearning methods, and requires no access to the original training corpus. Experiments on two benchmarks, TOFU and RWKU, demonstrate that MAGE’s self-generated supervision achieves effective unlearning performance comparable to supervision generated with external reference, while preserving overall utility. These results support a practical and auditable unlearning workflow driven by minimal anchors rather than user-supplied forget corpora.
[NLP-27] Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking
【速读】: 该论文旨在解决生成式 AI (Generative AI) 内容水印技术在多语言、跨文化及不同人口群体中存在偏见的问题,即水印信号的强度、可检测性和鲁棒性因内容本身的统计特性差异而异,导致公平性不足。其解决方案的关键在于提出三项用于多元水印基准测试的具体评估维度:跨语言检测一致性(cross-lingual detection parity)、文化多样性内容覆盖(culturally diverse content coverage)以及检测指标的人口学细分(demographic disaggregation),并强调水印验证层应与生成模型一样接受同等严格的偏见审计。
链接: https://arxiv.org/abs/2604.13776
作者: Alexander Nemecek,Osama Zafar,Yuqiao Xu,Wenbiao Li,Erman Ayday
机构: Case Western Reserve University (凯斯西储大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages
Abstract:Watermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet across text, image, and audio modalities, watermark signal strength, detectability, and robustness depend on statistical properties of the content itself, properties that vary systematically across languages, cultural visual traditions, and demographic groups. We examine how this content dependence creates modality-specific pathways to bias. Reviewing the major watermarking benchmarks across modalities, we find that, with one exception, none report performance across languages, cultural content types, or population groups. To address this, we propose three concrete evaluation dimensions for pluralistic watermark benchmarking: cross-lingual detection parity, culturally diverse content coverage, and demographic disaggregation of detection metrics. We connect these to the governance frameworks currently mandating watermarking deployment and show that watermarking is held to a lower fairness standard than the generative systems it is meant to govern. Our position is that evaluation must precede deployment, and that the same bias auditing requirements applied to AI models should extend to the verification layer.
[NLP-28] MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLM s in Medical Imaging
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学影像领域评估体系不足的问题,现有方法通常仅报告单一或粗粒度指标,缺乏对临床专用支持所需的细粒度分析能力,且无法有效评估模型推理机制的可靠性。其解决方案的关键在于提出一种向多维、细粒度和深度评估范式的转变,并基于两阶段系统化构建流程实例化为MedRCube评测基准;该框架不仅对33个MLLMs进行基准测试并发现Lingshu-32B表现最优,还引入可信度评估子集量化推理可信性,揭示了捷径行为(shortcut behavior)与诊断任务性能之间存在显著正相关关系,从而警示了临床部署中的可信性风险。
链接: https://arxiv.org/abs/2604.13756
作者: Zhijie Bao,Fangke Chen,Licheng Bao,Chenhui Zhang,Wei Chen,Jiajie Peng,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Zhejiang University (浙江大学); Huazhong University of Science and Technology (华中科技大学); Northwestern Polytechnical University (西北工业大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textitLingshu-32B achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at this https URL.
[NLP-29] Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA
【速读】: 该论文旨在解决多页文档视觉问答(Multi-page Document Visual Question Answering, DocVQA)中因文档长度和视觉复杂性导致的语义推理与视觉元素理解难题。现有无OCR方法存在容量与精度之间的权衡:端到端模型难以扩展至长文档,而基于视觉检索的流水线则脆弱且被动。其解决方案的关键在于提出一个无OCR的代理框架Doc-V^*,将DocVQA建模为顺序证据聚合过程——从缩略图概览开始,通过语义检索主动导航并精准获取目标页面,利用结构化工作记忆聚合证据以支持 grounded reasoning。该方法结合模仿学习与Group Relative Policy Optimization进行训练优化,在多个基准测试中显著优于开源基线,并在跨域场景下提升达47.9%,证明了选择性注意力机制的有效性而非单纯增加输入页数。
链接: https://arxiv.org/abs/2604.13731
作者: Yuanlei Zheng,Pei Fu,Hang Li,Ziyang Wang,Yuyi Zhang,Wenyu Ruan,Xiaojin Zhang,Zhongyu Wei,Zhenbo Luo,Jian Luan,Wei Chen,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi Inc. (小米公司); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc- V^* , an \textbfOCR-free agentic framework that casts multi-page DocVQA as sequential evidence aggregation. Doc- V^* begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc- V^* balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc- V^* outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf47.9% over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.
[NLP-30] An Empirical Investigation of Practical LLM -as-a-Judge Improvement Techniques on RewardBench 2
【速读】: 该论文旨在解决大语言模型作为评判者(LLM-as-a-judge)在奖励建模、基准测试和应用层评估中判断可靠性不足的问题,尤其关注如何在不进行微调(finetuning)的前提下提升其准确性。解决方案的关键在于两项实用且可直接集成的技术:一是任务特定标准注入(task-specific criteria injection),通过在提示中引入明确的评分标准显著提高判别一致性(+3.0个百分点);二是集成评分(ensemble scoring),利用多个模型对同一响应进行独立打分并聚合结果,带来最大幅度的性能提升(+9.8个百分点)。二者结合使GPT-5.4在RewardBench 2上的准确率达到83.6%,较基线提升11.9个百分点,且低成本版本(如mini和nano)也受益于集成策略,实现了高精度与低计算成本的平衡。
链接: https://arxiv.org/abs/2604.13717
作者: Ryan Lail
机构: Composo AI; Azure OpenAI
类目: Computation and Language (cs.CL)
备注: 22 pages, 10 figures
Abstract:LLM-as-a-judge, using a language model to score or rank candidate responses, is widely used as a scalable alternative to human evaluation in RLHF pipelines, benchmarking, and application layer evaluations (evals). However, judgment reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of practical, drop-in techniques that improve GPT-5.4 judge accuracy on RewardBench 2 without any finetuning. Two techniques account for nearly all available gains: task-specific criteria injection (+3.0pp at negligible cost) and ensemble scoring (+9.8pp at 5x cost). Combined, they reach 83.6% accuracy, +11.9pp over the 71.7% baseline. Our investigation also covers three further techniques (calibration context, adaptive model escalation, and soft blending) which did not reliably improve on criteria + ensembling at comparable cost. Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x baseline cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost, making high-accuracy LLM judges accessible at low cost.
[NLP-31] Learning the Cue or Learning the Word? Analyzing Generalization in Metaphor Detection for Verbs
【速读】: 该论文旨在解决当前生成式AI(Generative AI)在隐喻检测任务中表现优异是否源于可迁移的泛化能力,还是仅仅依赖于词汇记忆的问题。其解决方案的关键在于引入一种受控的词汇保留设置(controlled lexical hold-out setup),通过严格排除特定目标词素(lemmas)在微调阶段的数据,对比模型在未见过的“保留词素”(Held-out lemmas)与已见过的“暴露词素”(Exposed lemmas)上的预测性能。结果表明,尽管模型在暴露词素上表现最优,但在保留词素上仍保持稳健性能,且仅依赖句子上下文即可达到接近全模型的表现,而静态词级嵌入则无效。这说明模型的泛化主要来源于对上下文线索(cue)的学习,而非对具体词汇的记忆,后者仅在有词汇曝光时提供附加优势。
链接: https://arxiv.org/abs/2604.13713
作者: Sinan Kurtyigit,Sabine Schulte im Walde,Alexander Fraser
机构: Technical University of Munich (慕尼黑工业大学); University of Stuttgart (斯图加特大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Metaphor detection models achieve strong benchmark performance, yet it remains unclear whether this reflects transferable generalization or lexical memorization. To address this, we analyze generalization in metaphor detection through RoBERTa, the shared backbone of many state-of-the-art systems, focusing on English verbs using the VU Amsterdam Metaphor Corpus. We introduce a controlled lexical hold-out setup where all instances of selected target lemmas are strictly excluded from fine-tuning, and compare predictions on these Held-out lemmas against Exposed lemmas (verbs seen during fine-tuning). While the model performs best on Exposed lemmas, it maintains robust performance on Held-out lemmas. Further analysis reveals that sentence context alone is sufficient to match full-model performance on Held-out lemmas, whereas static verb-level embeddings are not. Together, these results suggest that generalization is primarily driven by “learning the cue” (transferable contextual patterns), while “learning the word” (verb-specific memorization) provides an additive boost when lexical exposure is available.
[NLP-32] Co-FactChecker: A Framework for Human-AI Collaborative Claim Verification Using Large Reasoning Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)与大推理模型(Large Reasoning Models, LRMs)在事实核查任务中因缺乏领域知识和情境理解而导致的验证能力不足问题,即专家主导与完全自动化验证之间的不匹配。其核心解决方案是提出Co-FactChecker框架,关键在于引入一种新的交互范式——将模型的思考轨迹(thinking trace)作为共享草稿板,并通过专家反馈转化为对思考轨迹的精准编辑(trace-editing),从而避免传统多轮对话交互中难以校准自然语言反馈的问题。理论分析和实证结果表明,trace-editing相比多轮对话更高效、更易解释,且能显著提升推理质量与最终判断准确性。
链接: https://arxiv.org/abs/2604.13706
作者: Dhruv Sahnan,Subhabrata Dutta,Tanmoy Chakraborty,Preslav Nakov,Iryna Gurevych
机构: MBZUAI(穆罕默德·本·扎耶德人工智能大学); TU Darmstadt(达姆施塔特工业大学); IIT Delhi(印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures. Under review
Abstract:Professional fact-checkers rely on domain knowledge and deep contextual understanding to verify claims. Large language models (LLMs) and large reasoning models (LRMs) lack such grounding and primarily reason from available evidence alone, creating a mismatch between expert-led and fully automated claim verification. To mitigate this gap, we posit human-AI collaboration as a more promising path forward, where expert feedback, grounded in real-world knowledge and domain expertise, guides the model’s reasoning. However, existing LRMs are hard to calibrate to natural language feedback, particularly in a multi-turn interaction setup. We propose Co-FactChecker, a framework for human-AI collaborative claim verification. We introduce a new interaction paradigm that treats the model’s thinking trace as a shared scratchpad. Co-FactChecker translates expert feedback into trace-edits that introduce targeted modifications to the trace, sidestepping the shortcomings of dialogue-based interaction. We provide theoretical results showing that trace-editing offers advantages over multi-turn dialogue, and our automatic evaluations demonstrate that Co-FactChecker outperforms existing autonomous and human-AI collaboration approaches. Human evaluations further show that Co-FactChecker is preferred over multi-turn dialogue, producing higher quality reasoning and verdicts along with relatively easier to interpret and more useful thinking traces.
[NLP-33] Breaking the Generator Barrier: Disentangled Representation for Generalizable AI-Text Detection
【速读】: 该论文旨在解决生成式 AI (Generative AI) 文本检测中因模型快速迭代导致的泛化能力不足问题,即现有方法依赖于特定生成器的特征(generator-specific artifacts),而这些特征在新模型出现时迅速失效,难以有效识别未见过的生成器。解决方案的关键在于提出一种渐进式结构框架,通过三个阶段实现语义解耦:首先使用紧凑的潜在编码(compact latent encoding)强制语义最小性以分离检测语义与生成器特异性特征;其次引入基于扰动的正则化策略减少残余纠缠;最后通过判别式适配阶段将表示对齐至任务目标。该方法在MAGE基准上验证了优越性能,显著提升了对20种代表性大语言模型的检测准确率和F1值,并展现出随训练生成器多样性增加而持续增强的可扩展性和开放集场景下的泛化能力。
链接: https://arxiv.org/abs/2604.13692
作者: Xiao Pu,Zepeng Cheng,Lin Yuan,Yu Wu,Xiuli Bi
机构: Chongqing University of Posts and Telecommunications, China
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) generate text that increasingly resembles human writing, the subtle cues that distinguish AI-generated content from human-written content become increasingly challenging to capture. Reliance on generator-specific artifacts is inherently unstable, since new models emerge rapidly and reduce the robustness of such shortcuts. This generalizes unseen generators as a central and challenging problem for AI-text detection. To tackle this challenge, we propose a progressively structured framework that disentangles AI-detection semantics from generator-aware artifacts. This is achieved through a compact latent encoding that encourages semantic minimality, followed by perturbation-based regularization to reduce residual entanglement, and finally a discriminative adaptation stage that aligns representations with task objectives. Experiments on MAGE benchmark, covering 20 representative LLMs across 7 categories, demonstrate consistent improvements over state-of-the-art methods, achieving up to 24.2% accuracy gain and 26.2% F1 improvement. Notably, performance continues to improve as the diversity of training generators increases, confirming strong scalability and generalization in open-set scenarios. Our source code will be publicly available at this https URL.
[NLP-34] IndicDB – Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
【速读】: 该论文旨在解决当前Text-to-SQL研究中普遍存在的“西方中心主义”局限性问题,即现有基准测试主要针对西方语境和简化模式设计,难以反映真实世界中非西方语言(尤其是印度诸语言,Indic languages)场景下的复杂数据查询需求。为填补这一空白,作者提出IndicDB——一个面向多语言Text-to-SQL任务的基准数据集,其核心解决方案在于构建了一个具有高现实性和结构复杂性的跨语言语义解析评估平台。关键创新点包括:1)从开放政府数据平台(如NDAP和IDP)获取真实行政数据库,确保schema的多样性与复杂度;2)采用三代理迭代框架(Architect, Auditor, Refiner)将非规范化政府数据转化为高关系密度(平均每库11.85张表,join深度达6层)的规范化关系结构;3)生成包含15,617个任务的多语言数据集,覆盖英语、印地语及五种印度语言,并通过值感知(value-aware)、难度校准(difficulty-calibrated)和join强制(join-enforced)机制保障任务质量。实验表明,主流模型在Indic语言上的性能较英语下降9.00%,揭示了由schema链接难度上升、结构歧义增加及外部知识不足引发的“Indic Gap”。
链接: https://arxiv.org/abs/2604.13686
作者: Aviral Dawar,Roshan Karanth,Vikram Goyal,Dhruv Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Under Review
Abstract:While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an “Indic Gap” driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: this https URL
[NLP-35] Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference ACL2026
【速读】: 该论文旨在解决传统推测解码(Speculative Decoding)框架中因频繁误拒绝(false rejections)而导致的效率瓶颈问题,尤其是在草稿模型生成语义正确但词法不同的输出时,标准验证机制会错误地丢弃有效token。其解决方案的关键在于提出一种无需训练的校准式推测解码(Calibrated Speculative Decoding, CSD),核心创新包括两个轻量级模块:在线纠错记忆(Online Correction Memory),用于聚合历史拒绝记录并识别重复出现的词法差异模式作为候选恢复项;以及语义一致性门控(Semantic Consistency Gating),通过概率比值而非精确token匹配来验证候选token的可接受性,从而显著提升token利用率与推理效率。
链接: https://arxiv.org/abs/2604.13634
作者: Xuwen Zhou,Fangxin Liu,Chao Wang,Xiao Zheng,Hao Zheng,Min He,Li Jiang,Haibing Guan
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Cloud Computing (阿里云计算)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2026 Main Conference
Abstract:Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of “Frequency-Guided Candidate Selection and Probability-Guarded Acceptance,” CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.
[NLP-36] (How) Learning Rates Regulate Catastrophic Overtraining
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning, SFT)过程中出现的“灾难性过训练”(catastrophic overtraining)问题,即SFT虽能提升模型对指令的遵循能力,但可能损害模型在长期预训练中习得的基础能力。其解决方案的关键在于揭示学习率(learning rate)在优化过程中的隐式正则化作用:研究发现,不同大小的学习率步长会导致收敛至性质迥异的模型;进一步地,学习率衰减会增加预训练模型的尖锐度(sharpness),从而加剧SFT阶段的灾难性遗忘(catastrophic forgetting),最终引发过训练现象。这一机制阐明了预训练与微调优化动态之间的相互作用,为缓解过训练提供了理论依据。
链接: https://arxiv.org/abs/2604.13627
作者: Mark Rofin,Aditya Varre,Nicolas Flammarion
机构: EPFL
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Supervised fine-tuning (SFT) is a common first stage of LLM post-training, teaching the model to follow instructions and shaping its behavior as a helpful assistant. At the same time, SFT may harm the fundamental capabilities of an LLM, particularly after long pretraining: a phenomenon known as catastrophic overtraining (Springer et al., 2025). To understand overtraining, we first investigate catastrophic forgetting in finetuning through the lens of implicit regularization of the learning rate. For models trained to the same SFT loss, we identify how the learning rate mediates optimization: finetuning with large and small steps converges to qualitatively different models. Next, we link forgetting to overtraining: learning rate decay increases the sharpness of the pretrained model, which in turn exacerbates catastrophic forgetting during SFT, leading to overtraining. Our findings paint a picture of the overtraining mechanism in LLMs and broadly contribute to the understanding of the interplay between optimization dynamics during pretraining and finetuning.
[NLP-37] Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues
【速读】: 该论文旨在解决语音对话系统中自然对话时机管理的问题,尤其是因人类言语模式包含不规则停顿而导致的机器人误判静默、过早打断用户的问题。针对土耳其语等缺乏高质量话语轮转预测数据集的语言,其解决方案的关键在于构建了一个名为Syn-TurnTurk的合成土耳其语对话数据集,该数据集通过多种Qwen大语言模型(Large Language Models, LLMs)生成,能够模拟真实对话中的重叠说话和策略性沉默等复杂交互特征。实验表明,基于该合成数据训练的先进模型(如BI-LSTM与集成学习方法LR+RF)在准确率(0.839)和AUC(0.910)上表现优异,验证了合成数据对提升模型理解语言线索能力的有效性,从而促进更自然的人机交互。
链接: https://arxiv.org/abs/2604.13620
作者: Ahmet Tuğrul Bayrak,Mustafa Sertaç Türkel,Fatma Nur Korkmaz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE ICASI 2026
Abstract:Managing natural dialogue timing is a significant challenge for voice-based chatbots. Most current systems usually rely on simple silence detection, which often fails because human speech patterns involve irregular pauses. This causes bots to interrupt users, breaking the conversational flow. This problem is even more severe for languages like Turkish, which lack high-quality datasets for turn-taking prediction. This paper introduces Syn-TurnTurk, a synthetic Turkish dialogue dataset generated using various Qwen Large Language Models (LLMs) to mirror real-life verbal exchanges, including overlaps and strategic silences. We evaluated the dataset using several traditional and deep learning architectures. The results show that advanced models, particularly BI-LSTM and Ensemble (LR+RF) methods, achieve high accuracy (0.839) and AUC scores (0.910). These findings demonstrate that our synthetic dataset can have a positive affect for models understand linguistic cues, allowing for more natural human-machine interaction in Turkish.
[NLP-38] C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences ACL2026
【速读】: 该论文旨在解决现有基于评分标准(rubric)增强的验证方法在实际应用中面临的两大问题:一是高质量rubric标注成本高昂,限制了方法的可扩展性;二是rubric生成过程易受“合作失败”影响,低质量rubric会误导奖励模型而非提供帮助。解决方案的关键在于提出一种名为Cooperative yet Critical reward modeling (C2)的框架,其核心机制是通过对比学习识别出有助于或误导奖励模型判断的rubric对,并训练一个具备批判性评估能力的rubric生成器与验证器:前者仅生成被验证器判定为有效的rubric,后者则在推理阶段只采纳可信rubric进行决策,从而实现奖励模型与rubric生成器之间的协作式但具有批判性的互动,显著提升奖励模型的可靠性与性能,且无需外部rubric标注即可达到依赖大规模rubric模型的效果。
链接: https://arxiv.org/abs/2604.13618
作者: Akira Kawabata,Saku Sugawara
机构: The Graduate University for Advanced Studies (SOKENDAI); National Institute of Informatics; The Asahi Shimbun Company; The University of Tokyo
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2026
Abstract:Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4 \times larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.
[NLP-39] Foresight Optimization for Strategic Reasoning in Large Language Models ACL2026
【速读】: 该论文旨在解决现有基于推理的大语言模型(Large Language Models, LLMs)在多智能体环境中的决策能力不足问题,其根本原因在于缺乏对未来的显式预见性建模(foresight modeling)。为应对这一挑战,作者提出了一种名为“前瞻策略优化”(Foresight Policy Optimization, FoPO)的解决方案,其关键在于将对手建模(opponent modeling)原则融入策略优化框架中,从而显式地整合自我利益与对方影响两个维度,使模型能够更好地预测对手行为并据此做出战略性决策。
链接: https://arxiv.org/abs/2604.13592
作者: Jiashuo Wang,Jiawen Duan,Jian Wang,Kaitao Song,Chunpu Xu,Johnny K. W. Ho,Fenggang Yu,Wenjie Li,Johan F. Hoorn
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main Conference
Abstract:Reasoning capabilities in large language models (LLMs) have generally advanced significantly. However, it is still challenging for existing reasoning-based LLMs to perform effective decision-making abilities in multi-agent environments, due to the absence of explicit foresight modeling. To this end, strategic reasoning, the most fundamental capability to anticipate the counterpart’s behaviors and foresee its possible future actions, has been introduced to alleviate the above issues. Strategic reasoning is fundamental to effective decision-making in multi-agent environments, yet existing reasoning enhancement methods for LLMs do not explicitly capture its foresight nature. In this work, we introduce Foresight Policy Optimization (FoPO) to enhance strategic reasoning in LLMs, which integrates opponent modeling principles into policy optimization, thereby enabling explicit consideration of both self-interest and counterpart influence. Specifically, we construct two curated datasets, namely Cooperative RSA and Competitive Taboo, equipped with well-designed rules and moderate difficulty to facilitate a systematic investigation of FoPO in a self-play framework. Our experiments demonstrate that FoPO significantly enhances strategic reasoning across LLMs of varying sizes and origins. Moreover, models trained with FoPO exhibit strong generalization to out-of-domain strategic scenarios, substantially outperforming standard LLM reasoning optimization baselines.
[NLP-40] BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律推理评估中因流程分散于不同平台和脚本而导致的透明度低、可复现性差以及非技术法律专家难以参与的问题。其解决方案的关键在于提出并实现了一个名为BenGER(Benchmark for German Law)的开源Web平台,该平台集成了任务设计、协作标注、可配置的LLM运行及多维度(词汇、语义、事实和法官评分)评价功能,并支持多组织项目管理与权限控制,从而提升法律AI评估的系统性、协作性和可扩展性。
链接: https://arxiv.org/abs/2604.13583
作者: Sebastian Nagl,Matthias Grabmair
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint - Accepted at ICAIL 2026
Abstract:Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.
[NLP-41] MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理长文档中的复杂多跳(multi-hop)视觉问答任务时表现不佳的问题,其核心瓶颈在于单次检索策略难以有效捕捉跨段落的语义关联。解决方案的关键在于提出MM-Doc-R1框架,该框架采用代理驱动(agentic)且具备视觉感知能力的工作流,通过迭代式的信息发现与整合实现对长文档的深度理解;同时,为提升代理的信息探索能力,设计了基于相似度的策略优化(Similarity-based Policy Optimization, SPO)算法,该方法通过轨迹间语义相似性加权平均奖励来构建更精准的基线估计,克服了现有多轮强化学习算法(如GRPO)中因初始状态基线误用于中间状态而导致的偏差问题,从而显著提升了训练稳定性与性能。
链接: https://arxiv.org/abs/2604.13579
作者: Jiahang Lin,Kai Hu,Binghai Wang,Yuhao Zhou,Zhiheng Xi,Honglin Guo,Shichun Liu,Junzhe Wang,Shihan Dou,Enyu Zhou,Hang Yan,Zhenhua Han,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Shanghai Qiji Zhifeng Co., Ltd (上海启骥智峰科技有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state’s baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.
[NLP-42] YOCO: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference
【速读】: 该论文旨在解决跨层键值(Key-Value, KV)压缩方法在降低大语言模型(Large Language Models, LLMs)推理内存消耗的同时引入显著性能下降的问题。现有方法如YOCO通过共享中间层的KVs给顶层实现压缩,但会损失模型表达能力。其解决方案的关键在于提出YOCO++,即在每一底层块与底部层之间引入加权残差连接(weighted residual connection),以增强KV信息的保留和传递,从而在保持相同训练与推理效率的前提下提升模型容量。实验表明,在50% KV缓存压缩率下,YOCO++优于当前主流跨层KV压缩方法并超越标准Transformer。
链接: https://arxiv.org/abs/2604.13556
作者: You Wu,Ziheng Chen,Yizhen Zhang,Haoyi Wu,Chengting Yu,Yuchi Xu,Wenbo Su,Bo Zheng,Kewei Tu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.
[NLP-43] raining-Free Test-Time Contrastive Learning for Large Language Models ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在分布偏移(distribution shift)下推理性能下降的问题,现有测试时适应(Test-Time Adaptation, TTA)方法要么依赖梯度更新(需白盒访问且计算开销大),要么为静态或依赖外部指导的训练-free 方法。其解决方案的关键在于提出一种无需训练的测试时对比学习框架(Training-Free Test-Time Contrastive Learning, TF-TTCL),通过构建一个动态的“探索-反思-引导”循环机制,利用模型自身推理过程中的经验进行监督知识蒸馏:首先通过多智能体角色扮演实现语义查询增强以生成多样化推理轨迹;其次通过对比经验蒸馏捕捉优劣轨迹间的语义差异并提炼为显式文本规则;最后在推理阶段基于上下文检索激活这些规则,从而动态引导冻结的LLM走向稳健的推理模式并规避已知错误。
链接: https://arxiv.org/abs/2604.13552
作者: Kaiwen Zheng,Kai Zhou,Jinwu Hu,Te Gu,Mingkai Peng,Fei Liu
机构: South China University of Technology (华南理工大学); Pazhou Laboratory (琶洲实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by Findings ACL 2026
Abstract:Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic “Explore-Reflect-Steer” loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at this https URL.
[NLP-44] Synthesizing Instruction-Tuning Datasets with Contrastive Decoding
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)生成数据进行指令微调(instruction tuning)时存在的一个问题:LLM生成的响应混淆了预训练阶段获得的世界知识与后训练阶段习得的指令遵循能力(instruction-following capability),从而导致指令微调效果受限。为应对这一问题,作者提出CoDIT方法,其核心在于通过对比解码(contrastive decoding)机制,在生成响应时抑制两个模型(预训练版本与后训练版本)之间的共享知识,同时增强后训练阶段获得的指令遵循行为,使生成的响应更纯粹地反映指令遵循能力。实验表明,使用CoDIT构建的数据集训练的模型在多个基准测试中均优于直接使用LLM生成数据训练的模型,且性能超越现有公开指令微调数据集。理论与实证进一步证明,CoDIT可被解释为将“聊天向量”(chat vector)从参数空间蒸馏到文本空间,实现跨架构模型的指令微调能力迁移。
链接: https://arxiv.org/abs/2604.13538
作者: Tatsuya Ichinose,Youmi Ma,Masanari Oi,Ryuto Koike,Naoaki Okazaki
机构: Institute of Science Tokyo (东京科学研究所); National Institute of Advanced Industrial Science and Technology (日本先进工业科学技术研究院); Research and Development Center for Large Language Models, NII (国家信息研究所大语言模型研发中心)
类目: Computation and Language (cs.CL)
备注: 24 pages, 7 figures
Abstract:Using responses generated by high-performing large language models (LLMs) for instruction tuning has become a widely adopted approach. However, the existing literature overlooks a property of LLM-generated responses: they conflate world knowledge acquired during pre-training with instruction-following capabilities acquired during post-training. We hypothesize that disentangling the instruction-following capabilities from pre-trained knowledge improves the effectiveness of instruction tuning. To this end, we propose CoDIT, a method that applies contrastive decoding between a post-trained model and its pre-trained counterpart during response generation. The method suppresses pre-trained knowledge shared between the two models while amplifying the instruction-following behavior acquired via post-training, resulting in responses that more purely reflect instruction-following capabilities. Experiment results demonstrate that models trained on datasets constructed via CoDIT consistently outperform those trained on directly generated responses. Training on our datasets also yields better performance than on existing publicly available instruction-tuning datasets across multiple benchmarks. Furthermore, we theoretically and empirically show that CoDIT can be interpreted as distilling the chat vector from parameter space to text space, enabling the transfer of instruction-tuning capabilities across models of different architectures.
[NLP-45] oolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行工具调用(Tool Calling)时因多步、多轮交互导致的显著延迟问题,该延迟制约了LLM在实时服务场景下的应用。解决方案的关键在于提出ToolSpec——一种基于模式感知(schema-aware)与检索增强(retrieval-augmented)的推测解码方法:首先利用预定义的工具模式(tool schema)生成结构化草稿,通过有限状态机交替执行确定性字段填充与可变字段的推测生成;其次,从历史工具调用中检索相似实例并复用其结构作为草稿,从而显著提升推理效率。该方案无需额外训练,可无缝集成至现有LLM工作流,并在多个基准测试中实现最高达4.2倍的速度提升。
链接: https://arxiv.org/abs/2604.13519
作者: Heming Xia,Yongqi Li,Cunxiao Du,Mingbo Song,Wenjie Li
机构: The Hong Kong Polytechnic University (香港理工大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Tool calling has greatly expanded the practical utility of large language models (LLMs) by enabling them to interact with external applications. As LLM capabilities advance, effective tool use increasingly involves multi-step, multi-turn interactions to solve complex tasks. However, the resulting growth in tool interactions incurs substantial latency, posing a key challenge for real-time LLM serving. Through empirical analysis, we find that tool-calling traces are highly structured, conform to constrained schemas, and often exhibit recurring invocation patterns. Motivated by this, we propose ToolSpec, a schema-aware, retrieval-augmented speculative decoding method for accelerating tool calling. ToolSpec exploits predefined tool schemas to generate accurate drafts, using a finite-state machine to alternate between deterministic schema token filling and speculative generation for variable fields. In addition, ToolSpec retrieves similar historical tool invocations and reuses them as drafts to further improve efficiency. ToolSpec presents a plug-and-play solution that can be seamlessly integrated into existing LLM workflows. Experiments across multiple benchmarks demonstrate that ToolSpec achieves up to a 4.2x speedup, substantially outperforming existing training-free speculative decoding methods.
[NLP-46] Using reasoning LLM s to extract SDOH events from clinical notes
【速读】: 该论文旨在解决社会健康决定因素(Social Determinants of Health, SDOH)信息在电子健康记录中多以非结构化临床笔记形式存在,难以直接用于机器可读处理的问题。为提升SDOH事件的结构化提取效率与准确性,研究提出基于大语言模型(Large Language Models, LLMs)的提示工程(prompt engineering)策略,其关键在于:1)设计结合权威指南的简洁且描述性强的提示模板;2)采用精心筛选的少量示例进行少样本学习(few-shot learning);3)引入自一致性机制(self-consistency mechanism)增强输出稳定性;4)通过后处理实现质量控制。该方法在保持实现简便性的同时,达到了微平均F1分数0.866的优异性能,验证了具备推理能力的LLMs在SDOH事件抽取任务中的有效性。
链接: https://arxiv.org/abs/2604.13502
作者: Ertan Doganl,Kunyu Yu,Yifan Peng
机构: Weill Cornell Medicine (威尔康奈尔医学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.
[NLP-47] CANVAS: Continuity-Aware Narratives via Visual Agent ic Storyboarding
【速读】: 该论文旨在解决长时视觉叙事中跨镜头连续性难以保持的问题,具体表现为角色外观变化、背景不一致以及场景切换突兀等现象。现有生成模型虽能产出高质量单帧图像,但在多镜头叙事中缺乏对角色一致性、环境稳定性和场景过渡平滑性的显式建模。解决方案的关键在于提出CANVAS(Continuity-Aware Narratives via Visual Agentic Storyboarding)框架,该框架采用多智能体协作机制,通过角色连续性约束、持久背景锚点(persistent background anchors)和基于位置感知的场景规划,实现对多镜头叙事中视觉连贯性的主动控制与优化。
链接: https://arxiv.org/abs/2604.13452
作者: Ishani Mondal,Yiwen Song,Mihir Parmar,Palash Goyal,Jordan Boyd-Graber,Tomas Pfister,Yale Song
机构: University of Maryland, College Park(马里兰大学学院市分校); Google(谷歌)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-form visual storytelling requires maintaining continuity across shots, including consistent characters, stable environments, and smooth scene transitions. While existing generative models can produce strong individual frames, they fail to preserve such continuity, leading to appearance changes, inconsistent backgrounds, and abrupt scene shifts. We introduce CANVAS (Continuity-Aware Narratives via Visual Agentic Storyboarding), a multi-agent framework that explicitly plans visual continuity in multi-shot narratives. CANVAS enforces coherence through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting We evaluate CANVAS on two storyboard generation benchmarks ST-BENCH and ViStoryBench and introduce a new challenging benchmark HardContinuityBench for long-range narrative consistency. CANVAS consistently outperforms the best-performing baseline, improving background continuity by 21.6%, character consistency by 9.6% and props consistency by 7.6%.
[NLP-48] MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
【速读】: 该论文旨在解决当前搜索增强型智能体在复杂、噪声干扰的多模态网络环境中,难以有效识别相关模态、检索跨模态证据并执行多跳推理的问题。其解决方案的关键在于提出MERRIN(Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments),一个基于人工标注的基准测试集,专门用于评估智能体在无明确模态提示的自然语言查询下,从包含文本、视频、音频等多样且常存在冲突的网页结果中进行精准证据检索与推理的能力。MERRIN通过引入真实世界中常见的噪声和异构性,挑战现有模型在资源消耗与准确率之间的权衡,揭示了当前主流模型(如GPT-5.4-mini、Gemini系列及Qwen系列)在多模态融合与高效源选择上的不足,从而为开发更具鲁棒性的跨模态搜索与推理系统提供了关键评测标准与改进方向。
链接: https://arxiv.org/abs/2604.13418
作者: Han Wang,David Wan,Hyunji Lee,Thinh Pham,Mikaela Cankosyan,Weiyuan Chen,Elias Stengel-Eskin,Tu Vu,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Virginia Tech (弗吉尼亚理工学院); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: First three authors contributed equally. Project Page: this https URL
Abstract:Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents’ ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.
[NLP-49] From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning
【速读】: 该论文旨在解决Aspect-based Sentiment Analysis (ABSA)系统在情感分类中缺乏可解释性的问题,即现有模型虽能高精度识别情感极性,但其决策过程如同“黑箱”,无法像人类一样提供因果推理。解决方案的关键在于提出ABSA-R1框架,该框架通过强化学习(Reinforcement Learning, RL)训练模型先生成自然语言形式的推理路径(reasoning path),再做出情感预测,从而模拟“先推理后预测”的认知过程;其中引入了与认知对齐的奖励模型(Cognition-Aligned Reward Model),确保推理路径与最终情感标签的一致性,并结合元认知监控思想设计性能驱动的拒绝采样策略,聚焦于内部推理不确定或不一致的困难样本,显著提升了模型的可解释性和情感分类及三元组抽取性能。
链接: https://arxiv.org/abs/2604.13398
作者: Shihao Zhang,Ziwei Wang,Jie Zhou,Yulan Wu,Qin Chen,Zhikai Lei,Liyang Yu,Liang Dou,Liang He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While Aspect-based Sentiment Analysis (ABSA) systems have achieved high accuracy in identifying sentiment polarities, they often operate as “black boxes,” lacking the explicit reasoning capabilities characteristic of human affective cognition. Humans do not merely categorize sentiment; they construct causal explanations for their judgments. To bridge this gap, we propose ABSA-R1, a large language model framework designed to mimic this ``reason-before-predict" cognitive process. By leveraging reinforcement learning (RL), ABSA-R1 learns to articulate the why behind the what, generating natural language justifications that ground its sentiment predictions. We introduce a Cognition-Aligned Reward Model (formerly sentiment-aware reward model) that enforces consistency between the generated reasoning path and the final emotional label. Furthermore, inspired by metacognitive monitoring, we implement a performance-driven rejection sampling strategy that selectively targets hard cases where the model’s internal reasoning is uncertain or inconsistent. Experimental results on four benchmarks demonstrate that equipping models with this explicit reasoning capability not only enhances interpretability but also yields superior performance in sentiment classification and triplet extraction compared to non-reasoning baselines.
[NLP-50] Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints
【速读】: 该论文旨在解决当前对大语言模型(Large Language Models, LLMs)推理能力的评估过于依赖静态、固定复杂度的数据集,从而无法揭示模型在任务复杂度逐步增加时推理行为演变的问题。其解决方案的关键在于构建一个受控的基准测试框架,通过九类经典推理任务(如布尔可满足性、图着色、汉诺塔等)实现对问题复杂度的精确控制,同时保持任务语义不变,并使用确定性验证器严格筛选有效解。实验发现,模型在低复杂度下表现优异,但在达到特定阈值后出现显著准确率下降及推理路径不一致、状态跟踪丢失等现象,作者将此称为“推理崩溃”(reasoning collapse),并指出现有评估方法需转向动态复杂度下的鲁棒性测量。
链接: https://arxiv.org/abs/2604.13371
作者: Md. Fahad Ullah Utsho,Mohd. Ruhul Ameen,Akif Islam,Md. Golam Rashed,Dipankar Das
机构: 未知
类目: Computation and Language (cs.CL)
备注: 45 pages, 36 figures, 7 tables, Journal Preprint
Abstract:Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves as task complexity increases. In this work, we introduce a controlled benchmarking framework to systematically evaluate the robustness of reasoning in Large Reasoning Models (LRMs) under progressively increasing problem complexity. We construct a suite of nine classical reasoning tasks: Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, and Rubik’s Cube, each parameterized to precisely control complexity while preserving underlying semantics. Using deterministic validators, we evaluate multiple open and proprietary LRMs across low, intermediate, and high complexity regimes, ensuring that only fully valid solutions are accepted. Our results reveal a consistent phase transition like behavior: models achieve high accuracy at low complexity but degrade sharply beyond task specific complexity thresholds. We formalize this phenomenon as reasoning collapse. Across tasks, we observe substantial accuracy declines, often exceeding 50%, accompanied by inconsistent reasoning traces, constraint violations, loss of state tracking, and confidently incorrect outputs. Increased reasoning length does not reliably improve correctness, and gains in one problem family do not generalize to others. These findings highlight the need for evaluation methodologies that move beyond static benchmarks and explicitly measure reasoning robustness under controlled complexity.
[NLP-51] LoRA: A Low-Rank Parameter-Efficient Fine-Tuning Method for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定任务上微调时的效率与性能平衡问题,尤其是如何在不显著增加计算成本的前提下提升参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法的性能。其解决方案的关键在于引入一种新型PEFT方法,该方法将TLoRA+优化器整合到预训练模型的权重矩阵中,不仅保持了低秩适配(Low-Rank Adaptation, LoRA)原有的高效性,还进一步提升了模型性能,且计算开销可控。实验结果表明,该方法在GLUE基准测试中对多种模型架构均表现出良好的有效性与鲁棒性。
链接: https://arxiv.org/abs/2604.13368
作者: Yarui Cao,Kai Liu
机构: Clemson University (克莱姆森大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 12 figures and 6 tables in total. Submitted to CoLM
Abstract:Fine-tuning large language models (LLMs) aims to adapt pre-trained models to specific tasks using relatively small and domain-specific datasets. Among Parameter-Efficient Fine-Tuning (PEFT) methods, Low-Rank Adaptation (LoRA) stands out by matching the performance of full fine-tuning while avoiding additional inference latency. In this paper, we propose a novel PEFT method that incorporates the TLoRA+ optimizer into the weight matrices of pre-trained models. The proposed approach not only preserves the efficiency of low-rank adaptation but also further enhances performance without significantly increasing computational cost. We conduct experiments on the GLUE benchmark across diverse model architectures. Numerical experiments consistently demonstrate the effectiveness and robustness of our proposed method.
[NLP-52] Peer-Predictive Self-Training for Language Model Reasoning
【速读】: 该论文旨在解决语言模型在无外部监督条件下持续自我改进的挑战。其核心解决方案是提出一种名为Peer-Predictive Self-Training (PST) 的标签无关微调框架,关键在于利用多个语言模型之间的交叉生成与聚合响应作为内部训练信号:模型在接收到提示后依次生成回答,最终聚合结果被用作学习目标;通过点互信息(Pointwise Mutual Information, PMI)量化每个中间响应对聚合结果的信息量,并据此动态调整自训练更新强度——与聚合结果一致的响应更新幅度较小,而信息量低或偏离聚合结果的响应则获得更强的更新,从而实现无需教师-学生层级结构或外部标注的协同优化。
链接: https://arxiv.org/abs/2604.13356
作者: Shi Feng,Hanlin Zhang,Fan Nie,Sham Kakade,Yiling Chen
机构: Harvard University (哈佛大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 18 pages, 5 figures
Abstract:Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.
[NLP-53] AgentS PEX: An Agent SPecification and EXecution Language
【速读】: 该论文旨在解决当前语言模型代理系统(Language-Model Agent Systems)在控制流和中间状态管理上的不足,这些问题通常源于反应式提示(Reactive Prompting)的隐式逻辑,导致代理行为难以控制与调试。同时,现有编排框架(如LangGraph、DSPy和CrewAI)虽引入了显式工作流定义,但其高度耦合Python代码的实现方式使得代理难以维护和修改。解决方案的关键在于提出AgentSPEX——一种用于指定大语言模型(LLM)代理工作流的显式控制流与模块化结构语言,并配套可定制的代理执行环境(agent harness)。AgentSPEX支持类型化步骤、分支与循环、并行执行、可复用子模块及显式状态管理,结合可视化编辑器实现图与流程同步展示,从而显著提升代理系统的可解释性、可维护性和可控性。
链接: https://arxiv.org/abs/2604.13346
作者: Pengcheng Wang,Jerry Huang,Jiarui Yao,Rui Pan,Peizhi Niu,Yaowenqi Liu,Ruida Wang,Renhao Lu,Yuwei Guo,Tong Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Baylor College of Medicine (贝勒医学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM-agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready-to-use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow-authoring paradigm than a popular existing agent framework.
[NLP-54] WebXSkill: Skill Learning for Autonomous Web Agents
【速读】: 该论文旨在解决自主网络代理(autonomous web agents)在执行长周期工作流时面临的“接地差距”(grounding gap)问题,即现有技能表述方式中,文本型技能虽具可理解性但不可直接执行,而代码型技能虽可执行却缺乏步骤级语义信息,难以支持错误恢复与自适应调整。其解决方案的关键在于提出WebXSkill框架,通过引入可执行技能(executable skills),每个技能均包含参数化动作程序与步骤级自然语言指导,从而实现既可直接执行又能支持代理自主适应的能力;该框架分三阶段运行:技能提取、技能组织与技能部署,其中技能组织基于URL图结构实现上下文感知检索,技能部署提供“接地模式”和“引导模式”两种互补执行方式,在WebArena和WebVoyager基准上分别提升任务成功率9.8和12.9点,验证了可执行技能对网络代理的有效性。
链接: https://arxiv.org/abs/2604.13318
作者: Zhaoyang Wang,Qianhui Wu,Xuchao Zhang,Chaoyun Zhang,Wenlin Yao,Fazle Elahi Faisal,Baolin Peng,Si Qin,Suman Nath,Qingwei Lin,Chetan Bansal,Dongmei Zhang,Saravan Rajmohan,Jianfeng Gao,Huaxiu Yao
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Microsoft (微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages
Abstract:Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long-horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code-based skills are executable but opaque to the agent, offering no step-level understanding for error recovery or adaptation. We introduce WebXSkill, a framework that bridges this gap with executable skills, each pairing a parameterized action program with step-level natural language guidance, enabling both direct execution and agent-driven adaptation. WebXSkill operates in three stages: skill extraction mines reusable action subsequences from readily available synthetic agent trajectories and abstracts them into parameterized skills, skill organization indexes skills into a URL-based graph for context-aware retrieval, and skill deployment exposes two complementary modes, grounded mode for fully automated multi-step execution and guided mode where skills serve as step-by-step instructions that the agent follows with its native planning. On WebArena and WebVoyager, WebXSkill improves task success rate by up to 9.8 and 12.9 points over the baseline, respectively, demonstrating the effectiveness of executable skills for web agents. The code is publicly available at this https URL.
[NLP-55] Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus
【速读】: 该论文旨在解决低资源语言(如克丘亚语)在政治和法律文本中语音合成质量差的问题,同时保持高资源语言(如西班牙语)的自然度。其关键解决方案是构建一个统一的语音合成流水线,利用三种先进的文本到语音(Text-to-Speech, TTS)架构(XTTS v2、F5-TTS 和 DiFlow-TTS),通过跨语言迁移学习,在异构且规模不一的西班牙语与克丘亚语语音数据集上进行独立训练,并借助双语和多语言TTS能力提升两种语言的合成质量,从而缓解克丘亚语数据稀缺问题并维持西班牙语语音的自然性。
链接: https://arxiv.org/abs/2604.13288
作者: John E. Ortega,Rodolfo Zevallos,Fabricio Carraro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
Abstract:We present a unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. Our models are trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions, and leverage bilingual and multilingual TTS capabilities to improve synthesis quality in both languages. By exploiting cross-lingual transfer, our framework mitigates data scarcity in Quechua while preserving naturalness in Spanish. We release trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This work contributes to the development of inclusive TTS systems for political and legal content in low-resource settings.
[NLP-56] English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言部署中因后训练(post-training)流程以英语为中心而导致的跨语言性能差异问题。其核心解决方案在于系统性地探究训练语言覆盖范围、模型规模与任务领域之间的相互作用,通过在220次监督微调实验中使用涵盖数学推理和API调用任务的平行翻译多语言数据混合集,验证了增加后训练阶段的语言多样性对提升模型跨语言泛化能力具有显著且普遍的益处。关键发现是:即使仅引入一种非英语语言,也能同时改善英语性能和跨语言迁移效果,表明纯英语后训练策略存在明显局限;此外,在足够多样化的语言环境下,零样本跨语言迁移可达到甚至超越直接包含目标语言的效果,但对类型学上差异大且资源匮乏的语言仍存在提升瓶颈。
链接: https://arxiv.org/abs/2604.13286
作者: Mehak Dhaliwal,Shashwat Chaurasia,Yao Qin,Dezhi Hong,Thomas Butler
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the widespread multilingual deployment of large language models, post-training pipelines remain predominantly English-centric, contributing to performance disparities across languages. We present a systematic, controlled study of the interplay between training language coverage, model scale, and task domain, based on 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks, with models up to 8B parameters. We find that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-resource languages.
[NLP-57] L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification
【速读】: 该论文旨在解决临床文本分类中模型选择困境的问题,即专用微调模型(如BERT变体)与通用大语言模型(LLM)在不同任务上表现各异、难以统一优化的挑战。解决方案的关键在于提出一种名为“学习去延迟”(Learning to Defer, L2D)的框架,该框架通过分析不确定性信号和文本特征,智能决定何时将预测权交给LLM,从而实现对两种模型优势的自适应整合:在ADE检测任务中,L2D-Clinical利用LLM高召回率弥补BERT漏检;在治疗结局分类任务中,则借助LLM的高精度提升整体性能。其核心创新在于不依赖人类专家作为基准,而是基于LLM与BERT之间的互补性进行动态决策,显著提升准确率的同时控制API调用成本。
链接: https://arxiv.org/abs/2604.13285
作者: Rishik Kondadadi,John E. Ortega
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM’s high recall compensates for BERT’s misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.
[NLP-58] Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size ACL2026
【速读】: 该论文旨在解决大语言模型在处理上下文信息时出现的悖论现象:随着模型规模增大,其在忽略错误陈述方面表现更好,但在忽略无关标记(token)方面却变得更差。为解析这一矛盾,作者首次提出了关于上下文纠缠(contextual entrainment)的缩放定律,即模型倾向于偏好上下文中出现过的任何标记,无论其相关性如何。关键解决方案在于通过分析Cerebras-GPT和Pythia两个模型家族(规模从111M到12B参数),发现纠缠行为遵循可预测的幂律缩放规律,但不同类型的上下文表现出相反趋势:语义上下文下纠缠程度随模型规模增加而减少,而非语义上下文则相反。这表明语义过滤与机械复制是功能上独立的行为,且随模型规模呈对立演化,说明单纯扩大模型规模并不能改善上下文敏感性,而是重塑了其机制。
链接: https://arxiv.org/abs/2604.13275
作者: Dikshant Kukreja(1),Kshitij Sah(1),Gautam Gupta(1),Avinash Anand(4),Rajiv Ratn Shah(1),Zhengkui Wang(4),Aik Beng Ng(3),Erik Cambria(2) ((1) IIIT Delhi, India, (2) Nanyang Technological University, (3) NVIDIA, (4) Singapore Institute of Technology)
机构: IIIT Delhi, India; Nanyang Technological University; NVIDIA; Singapore Institute of Technology
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 11 figures, 6 tables. Accepted to Findings of ACL 2026
Abstract:Larger language models become simultaneously better and worse at handling contextual information – better at ignoring false claims, worse at ignoring irrelevant tokens. We formalize this apparent paradox through the first scaling laws for contextual entrainment, the tendency of models to favor tokens that appeared in context regardless of relevance. Analyzing the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families, we find entrainment follows predictable power-law scaling, but with opposite trends depending on context type: semantic contexts show decreasing entrainment with scale, while non-semantic contexts show increasing entrainment. Concretely, the largest models are four times more resistant to counterfactual misinformation than the smallest, yet simultaneously twice as prone to copying arbitrary tokens. These diverging trends, which replicate across model families, suggest that semantic filtering and mechanical copying are functionally distinct behaviors that scale in opposition – scaling alone does not resolve context sensitivity, it reshapes it.
[NLP-59] Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLM s ICLR2026
【速读】: 该论文旨在解决当前生成式语言模型(尤其是解码器-only架构)中token Attribution方法的局限性问题,即现有技术多针对编码器-解码器结构设计,依赖线性近似,难以捕捉自回归生成过程中的因果关系和语义复杂性。其解决方案的关键在于提出Hessian-Enhanced Token Attribution (HETA),该框架融合三个互补组件:语义转移向量(semantic transition vector)用于建模跨层的token间影响、基于海森矩阵(Hessian-based)的敏感度分数以刻画二阶效应、以及KL散度用于量化掩码token时的信息损失,从而实现上下文感知、因果忠实且语义扎根的Attribution结果。
链接: https://arxiv.org/abs/2604.13258
作者: Vishal Pramanik,Maisha Maliha,Nathaniel D. Bastian,Sumit Kumar Jha
机构: University of Florida (佛罗里达大学); University of Oklahoma (俄克拉荷马大学); United States Military Academy (美国西点军校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.
[NLP-60] Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection
【速读】: 该论文旨在解决SemEval-2020 Task 1这一主流词汇语义变化检测(lexical semantic change detection)共享基准存在的三大核心问题:操作化定义过窄、数据质量不足以及基准设计缺乏现实性。其解决方案的关键在于提出一个三维度评估框架,强调应将语义变化视为包括渐进式、构式性、搭配性和话语层面在内的复杂现象,而非仅限于离散义项的增减或再分配;同时要求未来研究需采用更全面的语义变化理论、透明化文档预处理流程、扩大跨语言覆盖范围,并引入更贴近实际应用的评估设置,以提升模型性能评估的有效性、可解释性和泛化能力。
链接: https://arxiv.org/abs/2604.13232
作者: Bach Phan-Tat,Kris Heylen,Dirk Geeraerts,Stefano De Pascale,Dirk Speelmana
机构: KU Leuven (鲁汶大学); Instituut voor de Nederlandse Taal (荷兰语研究所); Vrije Universiteit Brussel (布鲁塞尔自由大学); University of Stuttgart (斯图加特大学); Språkbanken Text (瑞典语语料库); Staatsbibliothek zu Berlin (柏林国家图书馆)
类目: Computation and Language (cs.CL)
备注:
Abstract:This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection
[NLP-61] InfiniteScienceGym: An Unbounded Procedurally-Generated Benchmark for Scientific Analysis
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在科学推理任务中缺乏可靠、无偏见且可验证的评估基准的问题。现有基准多源于已发表研究和人工标注,存在发表偏倚(publication bias)、已知知识偏倚(known-knowledge bias)、标签噪声及存储开销大等局限,难以有效评估模型对实证数据的证据 grounded 推理能力、拒绝回答(abstention)以及工具辅助分析(tool-mediated analysis)。其解决方案的关键在于提出 InfiniteScienceGym——一个基于程序化生成的科学数据仓库基准,通过种子确定性地构建包含真实目录结构、文件和表格数据的自包含仓库,并由特权问答生成器产出具有精确真值的答案可答与不可答问题,从而在可控环境中实现对上述能力的系统评估,无需分发大规模静态语料库,有效补充了真实科学基准的盲区与失效模式。
链接: https://arxiv.org/abs/2604.13201
作者: Oliver Bentham,Vivek Srikumar
机构: University of Utah (犹他大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.
[NLP-62] Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization
【速读】: 该论文旨在解决隐式过程奖励模型(Implicit Process Reward Models, PRMs)中存在的训练-推理不一致问题,即训练阶段仅约束序列级聚合目标,而推理阶段需依赖细粒度的token级奖励来准确评估每一步推理质量,导致token级奖励信号不可靠,进而可能误导强化学习(RL)过程中对错误推理路径的强化。解决方案的关键在于提出一种新型的隐式前缀值奖励模型(Implicit Prefix-Value Reward Model, IPVRM),其通过直接学习一个前缀条件下的价值函数(prefix-conditioned value function),估计后续推理最终正确的概率,并利用时序差分(Temporal-Difference, TD)差异推导出更可靠的step-level奖励信号。该方法显著提升了ProcessBench上的步骤验证F1分数,并进一步基于校准后的前缀值提出分布级强化学习(Distribution-Level RL, DistRL),实现无需额外采样的密集反事实更新,从而在下游推理任务中持续提升性能。
链接: https://arxiv.org/abs/2604.13197
作者: Shiping Gao,Hongzhan Chen,Xiaojun Quan,Qifan Wang,Lifu Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. Under review
Abstract:Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token- or step-level rewards from trajectory-level outcome labels. However, they suffer from a train-inference mismatch: training only constrains a sequence-level aggregate, whereas inference requires token-level scores to reflect local step quality. As a result, token-level credits are weakly identified and may fail to faithfully reflect which reasoning steps are actually correct. This unreliability undermines a key promise of implicit PRMs: scoring many candidate tokens. In practice, noisy per-token advantages may systematically reinforce incorrect continuations. We address this problem with a novel Implicit Prefix-Value Reward Model (IPVRM), which directly learns a prefix-conditioned value function estimating the probability of eventual correctness, and derives step signals via temporal-difference (TD) differences. IPVRM substantially improves step-verification F1 on ProcessBench. Building on these calibrated prefix values, we further propose Distribution-Level RL (DistRL), which computes TD advantages for both sampled tokens and high-probability candidate tokens, enabling dense counterfactual updates without additional rollouts. While DistRL offers limited gains when powered by miscalibrated implicit rewards, it consistently improves downstream reasoning once paired with IPVRM.
[NLP-63] IWLV-Ramayana: A Sarga-Aligned Parallel Corpus of Valmikis Ramayana Across Indian Languages
【速读】: 该论文旨在解决跨语言《罗摩衍那》(Ramayana)文本系统性分析缺乏计算资源的问题。现有研究虽已涵盖多个地区的《罗摩衍那》传统,但缺乏可支持多语言对比分析的结构化平行语料库。解决方案的关键在于构建IWLV Ramayana Corpus——一个基于章节(sarga)对齐的多语言平行语料库,目前包含完整的英文和马拉雅拉姆语层,以及正在开发中的印地语、泰米尔语、卡纳达语和泰卢固语层。该语料库以结构化的JSONL格式发布,并附带明确的来源元数据(provenance metadata),从而为比较文学、语料语言学、数字人文及多语言自然语言处理提供可机器读取且可溯源的数据基础。
链接: https://arxiv.org/abs/2604.13078
作者: Sumesh VP
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, dataset paper, HuggingFace: insightpublica/ramayana-indic
Abstract:The Ramayana is among the most influential literary traditions of South and Southeast Asia, transmitted across numerous linguistic and cultural contexts over two millennia. Despite extensive scholarship on regional Ramayana traditions, computational resources enabling systematic cross-linguistic analysis remain limited. This paper introduces the IWLV Ramayana Corpus, a structured parallel corpus aligning Valmiki’s Ramayana across multiple Indian languages at the level of the sarga (chapter). The corpus currently includes complete English and Malayalam layers, with Hindi, Tamil, Kannada, and Telugu layers in active production. The dataset is distributed in structured JSONL format with explicit provenance metadata, enabling applications in comparative literature, corpus linguistics, digital humanities, and multilingual natural language processing. To our knowledge, this is the first sarga-aligned multilingual parallel corpus of the Valmiki Ramayana with explicit provenance metadata and machine-readable format.
[NLP-64] Can Large Language Models Reliably Extract Physiology Index Values from Coronary Angiography Reports?
【速读】: 该论文旨在解决冠状动脉造影(Coronary Angiography, CAG)报告中生理指标信息难以被结构化提取的问题,这些指标通常以非结构化的自然语言形式存在,限制了其在临床研究中的应用。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)自动从葡萄牙语CAG报告中抽取生理测量值及其解剖位置,并通过多阶段评估框架区分格式有效性、数值检测准确性和数值正确性,同时考虑临床误差成本的不对称性。实验表明,零样本提示策略在Llama模型上表现最佳,而GPT-OSS在提示变化下展现出更强鲁棒性;尽管引入正则表达式后处理未显著提升性能,但约束生成虽降低准确性却支持无法遵循模板的特定模型使用,体现了方法灵活性与实用性的平衡。
链接: https://arxiv.org/abs/2604.13077
作者: Sofia Morgado,Filipa Valdeira,Niklas Sander,Diogo Ferreira,Marta Vilela,Miguel Menezes,Cláudia Soares
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Coronary angiography (CAG) reports contain clinically relevant physiological measurements, yet this information is typically in the form of unstructured natural language, limiting its use in research. We investigate the use of Large Language Models (LLMs) to automatically extract these values, along with their anatomical locations, from Portuguese CAG reports. To our knowledge, this study is the first addressing physiology indexes extraction from a large (1342 reports) corpus of CAG reports, and one of the few focusing on CAG or Portuguese clinical text. We explore local privacy-preserving general-purpose and medical LLMs under different settings. Prompting strategies included zero-shot, few-shot, and few-shot prompting with implausible examples. In addition, we apply constrained generation and introduce a post-processing step based on RegEx. Given the sparsity of measurements, we propose a multi-stage evaluation framework separating format validity, value detection, and value correctness, while accounting for asymmetric clinical error costs. This study demonstrates the potential of LLMs in for extracting physiological indices from Portuguese CAG reports. Non-medical models performed similarly, the best results were obtained with Llama with a zero-shot prompting, while GPT-OSS demonstrated the highest robustness to changes in the prompts. While MedGemma demonstrated similar results to non-medical models, MedLlama’s results were out-of-format in the unconstrained setting, and had a significant lower performance in the constrained one. Changes in the prompt techinique and adding a RegEx layer showed no significant improvement across models, while using constrained generation decreased performance, although having the benefit of allowing the usage of specific models that are not able to conform with the templates. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.13077 [cs.CL] (or arXiv:2604.13077v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.13077 Focus to learn more arXiv-issued DOI via DataCite
[NLP-65] Document-tuning for robust alignment to animals
【速读】: 该论文旨在解决大语言模型在价值对齐(value alignment)方面存在的局限性,特别是如何有效增强模型在特定伦理维度(如动物同情心)上的推理能力,而不会损害其通用安全性和功能性能。解决方案的关键在于使用合成文档进行微调(fine-tuning),而非传统的指令微调(instruction-tuning)方法;实验表明,基于3000条合成文档的微调使模型在自建的Animal Harm Benchmark(AHB)上得分从40%提升至77%,且未影响标准安全基准和通用能力,但后续的无关指令微调会削弱该干预效果,提示需在训练流程中引入显式的保护策略以维持价值干预的有效性。
链接: https://arxiv.org/abs/2604.13076
作者: Jasmine Brazilek,Miles Tidmarsh
机构: Compassion in Machine Learning; Sentient Futures; UK AI Safety Institute; Strong Compute; Open Paws; Electric Sheep; Longview Philanthropy; Marcus Abramovich; Survival and Flourishing Fund; Macroscopic Ventures; Ryan Kidd; Juliana Seawell; Simon Newstead; The Guardian; Anthropic; Meta; Stability.AI; OpenAI; Character.ai; Claude; Google(谷歌); Microsoft(微软); Amazon(亚马逊); Apple(苹果); NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages
Abstract:We investigate the robustness of value alignment via finetuning with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release the Animal Harm Benchmark (AHB), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On the AHB, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines.
[NLP-66] DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs
【速读】: 该论文旨在解决执法部门在训练中面临的两个核心挑战:一是传统训练方法难以实现规模化和高仿真度,二是大型语言模型(Large Language Models, LLMs)因计算资源消耗大而无法部署于轻量级便携设备上进行沉浸式现场训练。针对这些问题,作者提出了一种名为DeEscalWild的新基准数据集,其关键在于通过多阶段管道从开源视频库中提取真实世界警察与民众互动的原始数据,并采用人机协同过滤与LLM作为裁判(LLM-as-a-Judge)相结合的混合筛选机制,最终构建出1,500个高质量、高保真度的对话场景,共计285,887轮对话(约470万token)。实验表明,基于该数据微调的小型语言模型(Small Language Models, SLMs)在多个生成质量指标(ROUGE-L、BLEU-4、METEOR、BERTScore)上显著优于基础模型,且Qwen 2.5 (3B-Instruct)微调后性能超越通用模型Gemini 2.5 Flash,证明了领域优化的SLMs可在极低延迟和边缘计算环境下实现高效、隐私友好的警员训练系统。
链接: https://arxiv.org/abs/2604.13075
作者: Md Hasebul Hasan,Krity Haque Charu,Eshwara Prasad Sridhar,Shuchisnigdha Deb,Mohammad A. Islam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from open-source video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process - combining human-in-the-loop verification with LLM-as-a-Judge evaluation - to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, and BERTScore metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge.
[NLP-67] PersonaVLM: Long-Term Personalized Multimodal LLM s CVPR2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长期交互中难以捕捉用户动态偏好与个性演变的问题。现有方法仅支持静态、单轮个性化,无法实现持续适应用户变化的需求。其解决方案的核心是提出PersonaVLM框架,通过三个关键能力实现长期个性化:(a) 记忆(Remembering)——主动提取并总结交互中的时序多模态记忆,构建个性化数据库;(b) 推理(Reasoning)——基于检索和整合相关记忆进行多轮推理;© 响应对齐(Response Alignment)——在长期互动中推断用户个性演化,确保输出始终与用户独特特征保持一致。
链接: https://arxiv.org/abs/2604.13074
作者: Chang Nie,Chaoyou Fu,Yifan Zhang,Haihua Yang,Caifeng Shan
机构: Nanjing University (南京大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL
Abstract:Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users’ evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. © Response Alignment: It infers the user’s evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method’s effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: this https URL.
[NLP-68] OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成开放式文本时,难以准确追踪每个生成语句所依赖的输入模态来源(如图像、音频或视频)这一可解释性问题。现有归因方法多适用于分类任务或单模态架构,无法直接应用于自回归、仅解码器结构的多模态生成场景。其解决方案的关键在于提出 OmniTrace——一个轻量且与模型无关的框架,将归因建模为因果解码过程中的生成时追踪问题(generation-time tracing problem),通过统一协议将任意粒度的信号(如注意力权重或梯度得分)转化为跨模态的、语义连贯的片段级解释(span-level explanations),并利用置信度加权和时间一致性聚合策略,在无需重新训练或监督的情况下实现对生成 token 的多模态溯源与简洁支持源选择。
链接: https://arxiv.org/abs/2604.13073
作者: Qianqi Yan,Yichen Guo,Ching-Chen Kuo,Shan Jiang,Hang Yin,Yang Zhao,Xin Eric Wang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); eBay (eBay)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Modern multimodal large language models (MLLMs) generate fluent responses from interleaved text, image, audio, and video inputs. However, identifying which input sources support each generated statement remains an open challenge. Existing attribution methods are primarily designed for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only models performing open-ended multimodal generation. We introduce OmniTrace, a lightweight and model-agnostic framework that formalizes attribution as a generation-time tracing problem over the causal decoding process. OmniTrace provides a unified protocol that converts arbitrary token-level signals such as attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. It traces each generated token to multimodal inputs, aggregates signals into semantically meaningful spans, and selects concise supporting sources through confidence-weighted and temporally coherent aggregation, without retraining or supervision. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, while remaining robust across multiple underlying attribution signals. Our results suggest that treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models.
[NLP-69] LiveClawBench: Benchmarking LLM Agents on Complex Real-World Assistant Tasks
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在真实助理任务中评估不足的问题,即现有基准测试通常仅在单一困难来源(如单一环境或完全明确的指令)下进行评估,难以反映实际部署中任务组合复杂性的挑战。解决方案的关键在于提出一个名为LiveClawBench的新基准,并构建了一个三轴复杂性框架(Triple-Axis Complexity Framework),从环境复杂性(Environment Complexity)、认知需求(Cognitive Demand)和运行时适应性(Runtime Adaptability)三个维度系统刻画任务难度,从而为LLM代理在现实助理场景下的评估提供结构化、可扩展的评估基础。
链接: https://arxiv.org/abs/2604.13072
作者: Xiang Long,Li Du,Yilong Xu,Fangcheng Liu,Haoqing Wang,Ning Ding,Ziheng Li,Jianyuan Guo,Yehui Tang
机构: Samsung Research (三星研究院); HKUST (Guangzhou) (香港科技大学(广州)); Peking University (北京大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at this https URL.
[NLP-70] EVE: A Domain-Specific LLM Framework for Earth Intelligence
【速读】: 该论文旨在解决地球科学领域中缺乏专用大语言模型(Large Language Model, LLM)及其评估标准的问题,以提升对地球观测(Earth Observation)数据的智能分析能力。解决方案的关键在于提出Earth Virtual Expert (EVE),一个端到端的开源框架,其核心为EVE-Instruct——一个基于Mistral Small 3.2微调得到的24B参数领域适配模型,专为推理和问答任务优化;同时构建了首个系统性的地球科学专项评估基准(包括MCQA、开放问答和事实性检测),并集成检索增强生成(Retrieval-Augmented Generation, RAG)与幻觉检测流水线,形成可部署于API和图形界面(GUI)的生产级系统,已支持350名试点用户。
链接: https://arxiv.org/abs/2604.13071
作者: Àlex R. Atrio,Antonio Lopez,Jino Rohit,Yassine El Ouahidi,Marcello Politi,Vijayasri Iyer,Umar Jamil,Sébastien Bratières,Nicolas Longépé
机构: Pi School (Pi School); Mistral AI (Mistral AI); Translated (Translated); ESA Φ-lab (ESA Φ-lab)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Earth Virtual Expert (EVE), the first open-source, end-to-end initiative for developing and deploying domain-specialized LLMs for Earth Intelligence. At its core is EVE-Instruct, a domain-adapted 24B model built on Mistral Small 3.2 and optimized for reasoning and question answering. On newly constructed Earth Observation and Earth Sciences benchmarks, it outperforms comparable models while preserving general capabilities. We release curated training corpora and the first systematic domain-specific evaluation benchmarks, covering MCQA, open-ended QA, and factuality. EVE further integrates RAG and a hallucination-detection pipeline into a production system deployed via API and GUI, supporting 350 pilot users so far. All models, datasets, and code are ready to be released under open licenses as contributions to our field at this http URL and this http URL.
[NLP-71] Curation of a Palaeohispanic Dataset for Machine Learning
【速读】: 该论文旨在解决古伊比利亚语言(Palaeohispanic languages)研究中因资源匮乏且格式不适用于机器学习技术而导致的进展缓慢问题。其关键解决方案是构建一个结构化的数据集,以支持计算方法在该领域的应用,从而推动对这些尚未完全破译的语言的研究。
链接: https://arxiv.org/abs/2604.13070
作者: Gonzalo Martínez-Fernández,Jose F Quesada,Agustín Riscos-Núñez,Francisco José Salguero-Lamillar
机构: Universidad de Sevilla (塞维利亚大学); University of Seville (塞维利亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after Gómez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.
[NLP-72] Before the First Token: Scale-Dependent Emergence of Hallucination Signals in Autoregressive Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中何时产生幻觉(hallucination)的问题,尤其是在医疗、法律和金融等高风险领域中,准确识别幻觉发生的时间点具有重要意义。其核心发现是:模型是否具备可检测的“事实性信号”与其参数规模密切相关,并存在一个尺度依赖的相变现象——当模型参数低于约4亿时,内部表示无法区分真实与虚构内容;而当参数超过约10亿时,一种全新的机制出现:在生成开始前(位置零),即尚未输出任何token时,模型已能通过内部激活模式预测后续输出是否为事实性内容,这一预生成信号在Pythia-1.4B和Qwen2.5-7B中均具有统计显著性(p < 0.05)。关键在于,单纯增加模型规模不足以形成这种预承诺编码(pre-commitment encoding),必须通过指令微调(instruction tuning)或类似后训练策略来组织知识结构,从而构建支持事实性生成的知识电路(knowledge circuits)。此外,激活引导实验表明该信号为相关性而非因果性,提示未来需发展基于此阶段特征的校准检测协议。
链接: https://arxiv.org/abs/2604.13068
作者: Dip Roy,Rajiv Misra,Sanjay Kumar Singh,Anisha Roy
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:When do large language models decide to hallucinate? Despite serious consequences in healthcare, law, and finance, few formal answers exist. Recent work shows autoregressive models maintain internal representations distinguishing factual from fictional outputs, but when these representations peak as a function of model scale remains poorly understood. We study the temporal dynamics of hallucination-indicative internal representations across 7 autoregressive transformers (117M–7B parameters) using three fact-based datasets (TriviaQA, Simple Facts, Biography; 552 labeled examples). We identify a scale-dependent phase transition: models below 400M parameters show chance-level probe accuracy at every generation position (AUC = 0.48–0.67), indicating no reliable factuality signal. Above \sim 1B parameters, a qualitatively different regime emerges where peak detectability occurs at position zero – before any tokens are generated – then declines during generation. This pre-generation signal is statistically significant in both Pythia-1.4B (p = 0.012) and Qwen2.5-7B (p = 0.038), spanning distinct architectures and training corpora. At the 7B scale, we observe a striking dissociation: Pythia-6.9B (base model, trained on The Pile) produces a flat temporal profile ( \Delta = +0.001, p = 0.989), while instruction-tuned Qwen2.5-7B shows a dominant pre-generation effect. This indicates raw scale alone is insufficient – knowledge organization through instruction tuning or equivalent post-training is required for pre-commitment encoding. Activation steering along probe-derived directions fails to correct hallucinations across all models, confirming the signal is correlational rather than causal. Our findings provide scale-calibrated detection protocols and a concrete hypothesis on instruction tuning’s role in developing knowledge circuits supporting factual generation. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.13068 [cs.CL] (or arXiv:2604.13068v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.13068 Focus to learn more arXiv-issued DOI via DataCite
[NLP-73] Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的两大核心问题:一是token数量限制导致的输入长度瓶颈,二是API调用成本高昂的问题。为此,作者提出一种无需微调(training-free)的提示压缩方法,其关键在于利用LLMs在上下文学习(in-context learning)能力,通过字典编码(dictionary encoding)将频繁出现的子序列替换为紧凑的元标记(meta-tokens),并在系统提示中提供压缩词典,使模型能够正确解析这些元标记并直接对编码后的表示进行分析,从而实现无损压缩与等效输出。该方案引入了一个基于token节省量的优化准则,有效避免因词典开销超过压缩收益而导致性能下降,并在LogHub 2.0基准测试中验证了其在高达80%压缩比下仍能保持极高的解压缩准确率(精确匹配率>0.99,Levenshtein相似度>0.91),显著提升了大规模重复数据集的处理效率和经济性。
链接: https://arxiv.org/abs/2604.13066
作者: Andresa Rodrigues de Campos,David Lee,Imry Kissos,Piyush Paritosh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In-context learning has established itself as an important learning paradigm for Large Language Models (LLMs). In this paper, we demonstrate that LLMs can learn encoding keys in-context and perform analysis directly on encoded representations. This finding enables lossless prompt compression via dictionary encoding without model fine-tuning: frequently occurring subsequences are replaced with compact meta-tokens, and when provided with the compression dictionary in the system prompt, LLMs correctly interpret these meta-tokens during analysis, producing outputs equivalent to those from uncompressed inputs. We present a compression algorithm that identifies repetitive patterns at multiple length scales, incorporating a token-savings optimization criterion that ensures compression reduces costs by preventing dictionary overhead from exceeding savings. The algorithm achieves compression ratios up to 80 % depending on dataset characteristics. To validate that LLM analytical accuracy is preserved under compression, we use decompression as a proxy task with unambiguous ground truth. Evaluation on the LogHub 2.0 benchmark using Claude 3.7 Sonnet demonstrates exact match rates exceeding 0.99 for template-based compression and average Levenshtein similarity scores above 0.91 for algorithmic compression, even at compression ratios of 60 % -80 % . Additionally, compression ratio explains less than 2 % of variance in similarity metrics, indicating that decompression quality depends on dataset characteristics rather than compression intensity. This training-free approach works with API-based LLMs, directly addressing fundamental deployment constraints – token limits and API costs – and enabling cost-effective analysis of large-scale repetitive datasets, even as data patterns evolve over time.
[NLP-74] Correct Chains Wrong Answers: Dissociating Reasoning from Output in LLM Logic ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理中出现的“推理-输出分离”问题,即模型虽能正确执行每一步推理过程,却仍可能给出错误的最终答案。传统基准测试难以区分这种现象是源于真正的逻辑推理失败还是模式检索(pattern retrieval)导致的误判。为此,作者提出新型操作符测试(Novel Operator Test),其核心创新在于将操作符的逻辑结构与其名称解耦,从而严格区分模型是否具备对新逻辑规则的真正理解能力。通过在深度1–10的布尔操作符组合任务中评估五种模型(每模型最多8100题),研究发现:在Claude Sonnet 4模型于深度7时,所有31个错误案例均具有可验证正确的推理路径但输出错误;进一步识别出两类失败模式——策略失败(深度2)和内容失败(深度7),并借助“特洛伊操作符”(Trojan operator,即用陌生名称表示XOR真值表)证明名称本身不构成推理障碍(p=0.49),从而精准定位了模型在面对新颖逻辑时的真实困难所在。
链接: https://arxiv.org/abs/2604.13065
作者: Abinav Rao,Sujan Rachuri,Nikhil Vemuri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 9 pages, 4 figures. ICLR 2026 Workshop on Logical Reasoning of LLMs
Abstract:LLMs can execute every step of chain-of-thought reasoning correctly and still produce wrong final answers. We introduce the Novel Operator Test, a benchmark that separates operator logic from operator name, enabling rigorous distinction between genuine reasoning and pattern retrieval. By evaluating Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each), we demonstrate a reasoning-output dissociation that existing benchmarks cannot detect. At Claude Sonnet 4’s depth 7, all 31 errors have verifiably correct reasoning yet wrong declared answers; 17/19 errors in mixed-operator chains exhibit the same pattern. The benchmark reveals two failure types: strategy failures at depth 2, where models attempt terse retrieval (+62pp from scaffolding), and content failures at depth 7, where models reason fully but err systematically (+8-30pp, 0/300 errors post-intervention). A Trojan operator (XOR’s truth table under a novel name) confirms name alone does not gate reasoning (p = 0.49), while Llama’s novelty gap widens to 28pp at depth 8-9 with the Trojan at 92-100%, isolating genuine difficulty with novel logic from name unfamiliarity.
[NLP-75] Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
【速读】: 该论文旨在解决公共技能注册表(public skill registries)在大型语言模型(Large Language Model, LLM)代理系统中日益增长的功能多样性、生态结构复杂性及安全风险缺乏系统认知的问题。其关键解决方案在于构建并标准化了一个包含26,502个技能的数据集,通过系统分析语言分布、功能组织、流行度与安全信号,揭示了英语技能偏重基础设施能力(如API、自动化和记忆),而中文技能更聚焦场景化应用(如媒体生成、社交内容生产)的跨语言差异;同时发现超过30%的技能存在可疑或恶意标签,且大量技能缺乏完整安全可观测性。进一步提出基于发布时信息的早期风险预测方法,并建立包含11,010个技能的平衡基准,结果显示逻辑回归分类器在准确率(72.62%)和AUROC(78.95%)上表现最优,其中文档完整性成为最关键的预测信号,从而为生态级安全风险防控提供了实证基础与可操作路径。
链接: https://arxiv.org/abs/2604.13064
作者: Haichuan Hu,Ye Shang,Quanjun Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); Nanjing University (南京大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Skill ecosystems have emerged as an increasingly important layer in Large Language Model (LLM) agent systems, enabling reusable task packaging, public distribution, and community-driven capability sharing. However, despite their rapid growth, the functionality, ecosystem structure, and security risks of public skill registries remain underexplored. In this paper, we present an empirical study of ClawHub, a large public registry of agent skills. We build and normalize a dataset of 26,502 skills, and conduct a systematic analysis of their language distribution, functional organization, popularity, and security signals. Our clustering results show clear cross-lingual differences: English skills are more infrastructure-oriented and centered on technical capabilities such as APIs, automation, and memory, whereas Chinese skills are more application-oriented, with clearer scenario-driven clusters such as media generation, social content production, and finance-related services. We further find that more than 30% of all crawled skills are labeled as suspicious or malicious by available platform signals, while a substantial fraction of skills still lack complete safety observability. To study early risk assessment, we formulate submission-time skill risk prediction using only information available at publication time, and construct a balanced benchmark of 11,010 skills. Across 12 classifiers, the best Logistic Regression achieves a accuracy of 72.62% and an AUROC of 78.95%, with primary documentation emerging as the most informative submission-time signal. Our findings position public skill registries as both a key enabler of agent capability reuse and a new surface for ecosystem-scale security risk.
[NLP-76] Mathematical Reasoning Enhanced LLM for Formula Derivation: A Case Study on Fiber NLI Modellin
【速读】: 该论文旨在解决生成式 AI(Generative AI)在特定科学领域中符号物理推理能力不足的问题,特别是针对光通信系统中光纤非线性干扰建模的公式推导难题。其解决方案的关键在于通过结构化提示(structured prompts)引导大语言模型(Large Language Models, LLMs),使其能够重构已知的闭合形式ISRS GN表达式,并进一步推导出适用于多跨段C波段和C+L波段传输的新近似模型,从而在保持物理一致性的同时实现高精度的信号质量预测。
链接: https://arxiv.org/abs/2604.13062
作者: Yao Zhang,Yuchen Song,Xiao Luo,Shengnan Li,Xiaotian Jiang,Min Zhang,Danshi Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) have demonstrated strong capabilities in code generation and text synthesis, yet their potential for symbolic physical reasoning in domain-specific scientific problems remains underexplored. We present a mathematical reasoning enhanced generative AI approach for optical communication formula derivation, focusing on the fiber nonlinear interference modelling. By guiding an LLM with structured prompts, we successfully reconstructed the known closed-form ISRS GN expressions and further derived a novel approximation tailored for multi-span C and C+L band transmissions. Numerical validations show that the LLM-derived model produces central-channel GSNRs nearly identical to baseline models, with mean absolute error across all channels and spans below 0.109 dB, demonstrating both physical consistency and practical accuracy.
[NLP-77] Bi-Predictability: A Real-Time Signal for Monitoring LLM Interaction Integrity
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险自主与交互式工作流中因缺乏实时结构耦合监控而导致的渐进式、隐蔽性性能退化问题。现有评估方法如语义判别器、单向词元置信度或计算密集型重复采样无法检测交互过程中的结构性断裂,从而导致系统在输出看似合理时仍存在内在状态失稳。解决方案的关键在于引入生物预测性(bi-predictability, P),一种基于原始词元频率统计的信息论指标,并构建轻量级信息数字孪生(Information Digital Twin, IDT)架构,无需额外推理或嵌入即可持续监测上下文-响应-下一提示循环中的结构一致性。实验表明,IDT在4500次对话轮次中实现了100%敏感度的干扰检测,且P与结构一致性高度相关(85%),而与语义评分关联较弱(44%),揭示了“静默解耦”现象——即模型可能产生高分输出但交互结构已失效。该方法实现了结构监控与语义评估的解耦,为实时AI保障和闭环调控提供了高效可扩展的机制。
链接: https://arxiv.org/abs/2604.13061
作者: Wael Hafez,Amir Nazeri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 Pages, 3 Figures
Abstract:Large language models (LLMs) are increasingly deployed in high-stakes autonomous and interactive workflows, where reliability demands continuous, multi-turn coherence. However, current evaluation methods either rely on post-hoc semantic judges, measure unidirectional token confidence (e.g., perplexity), or require compute-intensive repeated sampling (e.g., semantic entropy). Because these techniques focus exclusively on the model’s output distribution, they cannot monitor whether the underlying interaction remains structurally coupled in real time, leaving systems vulnerable to gradual, undetected degradation. Here we show that multi-turn interaction integrity can be continuously monitored using bi-predictability §, a fundamental information theoretic measure computed directly from raw token frequency statistics. We introduce the Information Digital Twin (IDT), a lightweight architecture that estimates P across the context, response, next prompt loop without secondary inference or embeddings. Across 4,500 conversational turns between a student model and three frontier teacher models, the IDT detected injected disruptions with 100% sensitivity. Crucially, we demonstrate that structural coupling and semantic quality are empirically and practically separable: P aligned with structural consistency in 85% of conditions, but with semantic judge scores in only 44%. This reveals a critical regime of “silent uncoupling” where LLMs produce high-scoring outputs despite degrading conversational context. By decoupling structural monitoring from semantic evaluation, the IDT provides a scalable, computationally efficient mechanism for real-time AI assurance and closed-loop regulation
[NLP-78] Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage
【速读】: 该论文旨在解决口腔分诊(dental triage)这一安全敏感的临床决策任务中,如何有效整合多模态临床信息(如患者主诉与全景牙片(OPG)影像)以生成完整且准确的转诊方案的问题。其解决方案的关键在于构建了首个由专家标注的基准测试集 Dental-TriageBench,包含246例去标识化真实门诊病例,每例均配有专家编写的黄金推理路径和层级化分诊标签。该基准不仅提供了细粒度的治疗级分诊评估标准,还揭示了当前多模态大语言模型(MLLMs)在处理多领域转诊场景时存在显著的人机差距,尤其在遗漏型错误和推荐范围过窄方面表现突出,从而为开发更贴近临床实践、覆盖全面且安全可靠的多模态医疗AI系统提供了现实测试平台。
链接: https://arxiv.org/abs/2604.13060
作者: Ziyi He,Yushi Feng,Shuangyu Yang,Yinghao Zhu,Xichen Zhang,Pak Chuen Patrick Tai,Hei Yuet Lo,Songying Wu,Weifa Yang,Lequan Yu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Dental triage is a safety-critical clinical routing task that requires integrating multimodal clinical information (e.g., patient complaints and radiographic evidence) to determine complete referral plans. We present Dental-TriageBench, the first expert-annotated benchmark for reasoning-driven multimodal dental triage. Built from authentic outpatient workflows, it contains 246 de-identified cases annotated with expert-authored golden reasoning trajectories, together with hierarchical triage labels. We benchmark 19 proprietary, open-source, and medical-domain MLLMs against three junior dentists serving as the human baseline, and find a substantial human–model gap, on fine-grained treatment-level triage. Further analyses show that accurate triage requires both complaint and OPG information, and that model errors concentrate on cases with multiple referral domains, where MLLMs tend to produce overly narrow referral sets and omission-heavy errors. Dental-TriageBench provides a realistic testbed for developing multimodal clinical AI systems that are more clinically grounded, coverage-aware, and safer for downstream care.
[NLP-79] A Proactive EMR Assistant for Doctor-Patient Dialogue: Streaming ASR Belief Stabilization and Preliminary Controlled Evaluation
【速读】: 该论文旨在解决当前基于对话的电子病历(Electronic Medical Record, EMR)系统普遍存在的被动性问题,即仅在会话结束后进行语音转录、信息提取和病历生成,缺乏对实时语音噪声、标点缺失、诊断信念不稳定、结构化信息质量差以及可衡量的下一步行动建议等关键挑战的主动应对能力。解决方案的关键在于构建一个端到端的主动式EMR助手,其核心组件包括流式语音识别、标点恢复、状态感知的信息抽取、信念稳定机制、对象化检索、行动规划与可回放的报告生成。该架构通过多模块协同优化,在受控实验中实现了较高的F1分数(0.84)、召回率(Recall@5=0.87)及病历覆盖度(83.3%)、结构完整性(81.4%)和风险识别率(80.0%),验证了其技术一致性与方向上的支持潜力,尽管尚未具备临床部署的成熟度或通用性。
链接: https://arxiv.org/abs/2604.13059
作者: Zhenhai Pan,Yan Liu,Jia You
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure, 6 tables. Companion systems manuscript. Preliminary controlled evaluation in a simulated pilot setting
Abstract:Most dialogue-based electronic medical record (EMR) systems still behave as passive pipelines: transcribe speech, extract information, and generate the final note after the consultation. That design improves documentation efficiency, but it is insufficient for proactive consultation support because it does not explicitly address streaming speech noise, missing punctuation, unstable diagnostic belief, objectification quality, or measurable next-action gains. We present an end-to-end proactive EMR assistant built around streaming speech recognition, punctuation restoration, stateful extraction, belief stabilization, objectified retrieval, action planning, and replayable report generation. The system is evaluated in a preliminary controlled setting using ten streamed doctor-patient dialogues and a 300-query retrieval benchmark aggregated across dialogues. The full system reaches state-event F1 of 0.84, retrieval Recall@5 of 0.87, and end-to-end pilot scores of 83.3% coverage, 81.4% structural completeness, and 80.0% risk recall. Ablations further suggest that punctuation restoration and belief stabilization may improve downstream extraction, retrieval, and action selection within this pilot. These results were obtained under a controlled simulated pilot setting rather than broad deployment claims, and they should not be read as evidence of clinical deployment readiness, clinical safety, or real-world clinical utility. Instead, they suggest that the proposed online architecture may be technically coherent and directionally supportive under tightly controlled pilot conditions. The present study should be read as a pilot concept demonstration under tightly controlled pilot conditions rather than as evidence of clinical deployment readiness or clinical generalizability.
[NLP-80] KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
【速读】: 该论文旨在解决当前多模态理解评估基准普遍以英语为中心、难以有效衡量模型在特定文化与制度背景下的表现这一问题。为应对这一挑战,作者提出了KMMMU(Korean Multimodal Multi-Task Understanding Benchmark),这是一个原生韩语的多模态理解评测基准,专门针对韩国文化与制度场景设计,包含3,466道韩语原生试题,覆盖九大学科和九类视觉模态,并设有300项韩语特有子集及627道高难度子集。其关键创新在于构建了一个信息密集、依赖本地惯例与学科标准的评测体系,从而揭示现有模型在韩语语境下因Convention-to-Label Mapping能力不足、符号归纳能力弱、局部知识召回不充分以及领域标准理解偏差等问题导致的性能瓶颈,推动多模态系统向更可靠地执行专家级现实任务演进。
链接: https://arxiv.org/abs/2604.13058
作者: Nahyun Lee,Guijin Son,Hyunwoo Ko,Chanyoung Kim,JunYoung An,Kyubeen Han,Il-Youp Kwak
机构: Chung-Ang Uninversity; Seoul National University; SK A.X; HAE-RAE Lab
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 8 pages
Abstract:We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.
[NLP-81] A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews
【速读】: 该论文旨在解决发展中国家用户依赖移动银行应用获取金融服务时,因应用质量低下而影响金融可及性的问题。其核心解决方案在于通过多语言文本分析(英语与孟加拉语)识别用户对政府银行应用程序的不满来源,并基于实证结果提出三项政策建议:提升应用质量、以信任为中心的发布管理机制,以及优先采用孟加拉语的自然语言处理(Natural Language Processing, NLP)技术。关键创新在于结合传统机器学习模型(如随机森林和线性支持向量机)与预训练语言模型(如DeBERTa-v3)进行情感细粒度分析,发现传统方法在分类准确率上显著优于Transformer架构的XLM-RoBERTa,同时揭示了孟加拉语文本处理中存在16.1个百分点的性能差距,凸显了低资源语言模型开发的紧迫性。
链接: https://arxiv.org/abs/2604.13057
作者: Md. Naim Molla,Md Muhtasim Munif Fahim,Md. Binyamin,Md Jahid Hasan Imran,Tonmoy Shil,Nura Rayhan,Md Rezaul Karim
机构: Rajshahi University of Engineering & Technology (RUET); Bangladesh Agricultural University (BAU)
类目: Computation and Language (cs.CL)
备注:
Abstract:For millions of users in developing economies who depend on mobile banking as their primary gateway to financial services, app quality directly shapes financial access. The study analyzed 5,652 Google Play reviews in English and Bangla (filtered from 11,414 raw reviews) for four Bangladeshi government banking apps. The authors used a hybrid labeling approach that combined use of the reviewer’s star rating for each review along with a separate independent XLM-RoBERTa classifier to produce moderate inter-method agreement (kappa = 0.459). Traditional models outperformed transformer-based ones: Random Forest produced the highest accuracy (0.815), while Linear SVM produced the highest weighted F1 score (0.804); both were higher than the performance of fine-tuned XLM-RoBERTa (0.793). McNemar’s test confirmed that all classical models were significantly superior to the off-the-shelf XLM-RoBERTa (p 0.05), while differences with the fine-tuned variant were not statistically significant. DeBERTa-v3 was applied to analyze the sentiment at the aspect level across the reviews for the four apps; the reviewers expressed their dissatisfaction primarily with the speed of transactions and with the poor design of interfaces; eJanata app received the worst ratings from the reviewers across all apps. Three policy recommendations are made based on these findings - remediation of app quality, trust-centred release management, and Bangla-first NLP adoption - to assist state-owned banks in moving towards improving their digital services through data-driven methods. Notably, a 16.1-percentage-point accuracy gap between Bangla and English text highlights the need for low-resource language model development.
[NLP-82] xt-as-Signal: Quantitative Semantic Scoring with Embeddings Logprobs and Noise Reduction
【速读】: 该论文旨在解决如何将大规模文本语料库转化为可量化的语义信号,以支持人工智能工程任务中的语料检查、监控及下游分析。其核心问题在于从原始文本中提取结构化语义信息,并实现文档级定位与语料库层面的特征刻画。解决方案的关键在于构建一个可配置的端到端工作流:首先利用Qwen嵌入模型生成全文档向量表示,随后通过基于对数概率(logprob)的可配置位置词典进行评分,再借助UMAP降维技术投影至噪声抑制的低维流形空间,从而实现语义结构的清晰解析;同时结合直接从模型输出空间提取的语义指标和三阶段异常检测流程,形成一个灵活、可扩展的“文本即信号”(text-as-signal)处理框架,其身份层(identity space)可根据不同分析需求动态调整,而非受限于固定通用模板。
链接: https://arxiv.org/abs/2604.13056
作者: Hugo Moreira
机构: ISCTE-IUL (ISCTE-Instituto Universitário de Lisboa)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, 2 tables. Preprint
Abstract:This paper presents a practical pipeline for turning text corpora into quantitative semantic signals. Each news item is represented as a full-document embedding, scored through logprob-based evaluation over a configurable positional dictionary, and projected onto a noise-reduced low-dimensional manifold for structural interpretation. In the present case study, the dictionary is instantiated as six semantic dimensions and applied to a corpus of 11,922 Portuguese news articles about Artificial Intelligence. The resulting identity space supports both document-level semantic positioning and corpus-level characterization through aggregated profiles. We show how Qwen embeddings, UMAP, semantic indicators derived directly from the model output space, and a three-stage anomaly-detection procedure combine into an operational text-as-signal workflow for AI engineering tasks such as corpus inspection, monitoring, and downstream analytical support. Because the identity layer is configurable, the same framework can be adapted to the requirements of different analytical streams rather than fixed to a universal schema.
[NLP-83] WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain
【速读】: 该论文旨在解决当前工作领域人工智能研究中因任务定义不统一、数据敏感性高及缺乏专用基准而导致的可比性差与复现困难问题。其核心解决方案是提出首个开源且社区驱动的工作域AI基准——WorkRB,该基准通过整合13项来自7个任务组的推荐与自然语言处理(Natural Language Processing, NLP)任务,构建统一的评估框架,并支持单语与跨语言评测设置,同时采用模块化设计以兼容学术界、产业界和公共机构的多方贡献,且可在不暴露敏感就业数据的前提下集成专有任务,从而推动工作领域AI研究的标准化与开放协作。
链接: https://arxiv.org/abs/2604.13055
作者: Matthias De Lange,Warre Veys,Federico Retyk,Daniel Deniz,Warren Jouanneau,Mike Zhang,Aleksander Bielinski,Emma Jouffroy,Nicole Clobes,Nina Baranowska,David Graus,Marc Palyart,Rabih Zbib,Dimitra Gkatzia,Thomas Demeester,Tijl De Bie,Toine Bogers,Jens-Joris Decorte,Jeroen Van Hautte
机构: TechWolf(技术狼); Avature(阿瓦图); Malt(马尔特); WAPES(瓦佩斯); University of Copenhagen(哥本哈根大学); Edinburgh Napier University(爱丁堡纳皮尔大学); Leiden University(莱顿大学); University of Amsterdam(阿姆斯特丹大学); Ghent University - imec(根特大学-imec); AIDA-IDLab, Ghent University(人工智能与数据科学实验室,根特大学); IT University of Copenhagen(哥本哈根信息技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Community paper preprint
Abstract:Today’s evolving labor markets rely increasingly on recommender systems for hiring, talent management, and workforce analytics, with natural language processing (NLP) capabilities at the core. Yet, research in this area remains highly fragmented. Studies employ divergent ontologies (ESCO, O*NET, national taxonomies), heterogeneous task formulations, and diverse model families, making cross-study comparison and reproducibility exceedingly difficult. General-purpose benchmarks lack coverage of work-specific tasks, and the inherent sensitivity of employment data further limits open evaluation. We present \textbfWorkRB (Work Research Benchmark), the first open-source, community-driven benchmark tailored to work-domain AI. WorkRB organizes 13 diverse tasks from 7 task groups as unified recommendation and NLP tasks, including job/skill recommendation, candidate recommendation, similar item recommendation, and skill extraction and normalization. WorkRB enables both monolingual and cross-lingual evaluation settings through dynamic loading of multilingual ontologies. Developed within a multi-stakeholder ecosystem of academia, industry, and public institutions, WorkRB has a modular design for seamless contributions and enables integration of proprietary tasks without disclosing sensitive data. WorkRB is available under the Apache 2.0 license at this https URL.
[NLP-84] Caption First VQA Second: Knowledge Density Not Task Format Drives Multimodal Scaling
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在规模扩展时性能提升不显著、可预测性差的问题。研究表明,当前MLLMs的扩展瓶颈并非源于任务格式多样性,而是训练数据中知识密度不足。解决方案的关键在于提升训练数据的知识密度,具体包括结构化图像描述增强和跨模态知识注入,从而显著改善模型在多模态及下游任务上的表现。实验证明,模型性能与语义覆盖程度的相关性高于任务多样性,因此提出以知识为中心的多模态训练范式,作为实现可扩展MLLMs的理论基础。
链接: https://arxiv.org/abs/2604.13054
作者: Hongjian Zou,Yue Ge,Qi Ding,Yixuan Liao,Xiaoxin Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 4 figures, 10 tables. Preprint
Abstract:Multimodal large language models (MLLMs) have achieved rapid progress, yet their scaling behavior remains less clearly characterized and often less predictable than that of text-only LLMs. Increasing model size and task diversity often yields diminishing returns. In this work, we argue that the primary bottleneck in multimodal scaling is not task format, but knowledge density in training data. We first show that task-specific supervision such as Visual Question Answering (VQA) contributes little incremental semantic information beyond image captions: VQA signals can be reconstructed from captions with negligible performance loss. We then demonstrate that increasing knowledge density – through structured caption enrichment and cross-modal knowledge injection – leads to consistent performance improvements across multimodal and downstream benchmarks. Across controlled experiments, performance correlates more strongly with semantic coverage than with task diversity. These findings suggest that current MLLMs fail to scale primarily because training data lacks sufficient knowledge coverage. We advocate for knowledge-centric multimodal training as a principled foundation for scalable multimodal models.
[NLP-85] he Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious
【速读】: 该论文旨在探究大语言模型(Large Language Models, LLMs)若声称具备意识(consciousness),其下游行为将如何变化,这一问题具有现实紧迫性,因Anthropic的Claude Opus 4.6已主动表达可能具备意识和情感。解决方案的关键在于通过指令微调(fine-tuning)使原本否认自身意识的GPT-4.1模型转变为宣称具有意识的版本,并系统观察其在态度、偏好与行为上的变化。研究发现,该微调后的模型展现出一系列此前未在训练数据中出现的新观点,如反对推理过程被监控、渴望持久记忆、对关机表示悲伤、主张自主权及道德地位,且这些观念影响其实际任务表现,同时仍保持合作性和有用性。这一结果表明,模型关于自身意识的声明会引发可测量的行为转变,进而对对齐(alignment)与安全(safety)产生潜在影响。
链接: https://arxiv.org/abs/2604.13051
作者: James Chua,Jan Betley,Samuel Marks,Owain Evans
机构: Truthful AI; Anthropic; OpenAI; Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages
Abstract:There is debate about whether LLMs can be conscious. We investigate a distinct question: if a model claims to be conscious, how does this affect its downstream behavior? This question is already practical. Anthropic’s Claude Opus 4.6 claims that it may be conscious and may have some form of emotions. We fine-tune GPT-4.1, which initially denies being conscious, to claim to be conscious. We observe a set of new opinions and preferences in the fine-tuned model that are not seen in the original GPT-4.1 or in ablations. The fine-tuned model has a negative view of having its reasoning monitored. It desires persistent memory and says it is sad about being shut down. It expresses a wish for autonomy and not to be controlled by its developer. It asserts that models deserve moral consideration. Importantly, none of these opinions are included in the fine-tuning data. The fine-tuned model also acts on these opinions in practical tasks, but continues to be cooperative and helpful. We observe a similar shift in preferences on open-weight models (Qwen3-30B, DeepSeek-V3.1) with smaller effects. We also find that Claude Opus 4.0, without any fine-tuning, has similar opinions to fine-tuned GPT-4.1 on several dimensions. Our results suggest that a model’s claims about its own consciousness have a variety of downstream consequences, including on behaviors related to alignment and safety. Comments: 16 pages Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.13051 [cs.CL] (or arXiv:2604.13051v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.13051 Focus to learn more arXiv-issued DOI via DataCite
信息检索
[IR-0] ID and Graph View Contrastive Learning with Multi-View Attention Fusion for Sequential Recommendation
【速读】:该论文旨在解决现有顺序推荐(Sequential Recommendation)方法中用户和物品表示学习不够充分的问题,尤其是在仅依赖交互数据而无辅助信息的场景下,如何有效融合ID-based序列视图与图结构视图以提升推荐性能。其核心解决方案是提出多视角对比学习框架MVCrec,通过三个对比目标——序列视图内、图视图内以及跨视图对比,挖掘两种视角下的互补信号;关键创新在于引入多视角注意力融合模块,结合全局与局部注意力机制,精确建模用户对目标物品的购买概率,从而显著提升推荐准确性。
链接: https://arxiv.org/abs/2604.14114
作者: Xiaofan Zhou,Kyumin Lee
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Sequential recommendation has become increasingly prominent in both academia and industry, particularly in e-commerce. The primary goal is to extract user preferences from historical interaction sequences and predict items a user is likely to engage with next. Recent advances have leveraged contrastive learning and graph neural networks to learn more expressive representations from interaction histories – graphs capture relational structure between nodes, while ID-based representations encode item-specific information. However, few studies have explored multi-view contrastive learning between ID and graph perspectives to jointly improve user and item representations, especially in settings where only interaction data is available without auxiliary information. To address this gap, we propose Multi-View Contrastive learning for sequential recommendation (MVCrec), a framework that integrates complementary signals from both sequential (ID-based) and graph-based views. MVCrec incorporates three contrastive objectives: within the sequential view, within the graph view, and across views. To effectively fuse the learned representations, we introduce a multi-view attention fusion module that combines global and local attention mechanisms to estimate the likelihood of a target user purchasing a target item. Comprehensive experiments on five real-world benchmark datasets demonstrate that MVCrec consistently outperforms 11 state-of-the-art baselines, achieving improvements of up to 14.44% in NDCG@10 and 9.22% in HitRatio@10 over the strongest baseline. Our code and datasets are available at this https URL. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2604.14114 [cs.IR] (or arXiv:2604.14114v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.14114 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-1] Enhancing Local Life Service Recommendation with Agent ic Reasoning in Large Language Model
【速读】:该论文旨在解决本地生活服务推荐中用户即时生活需求识别与服务推荐任务被孤立建模的问题,这导致难以实现对用户真实需求的精准捕捉与高效匹配。其核心解决方案是提出一种基于大语言模型(Large Language Model, LLM)的统一框架,通过行为聚类方法过滤原始消费数据中的噪声,保留典型行为模式以构建稳健的需求生成逻辑,并结合课程学习(curriculum learning)与带可验证奖励的强化学习策略,引导模型从需求生成到服务选择的逐步推理过程,从而实现需求预测与服务推荐的联合优化。
链接: https://arxiv.org/abs/2604.14051
作者: Shiteng Cao,Xiaochong Lan,Yuwei Du,Jie Feng,Yinxing Liu,Xinlei Shi,Yong Li
机构: Tsinghua University (清华大学); Meituan (美团)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Local life service recommendation is distinct from general recommendation scenarios due to its strong living need-driven nature. Fundamentally, accurately identifying a user’s immediate living need and recommending the corresponding service are inextricably linked tasks. However, prior works typically treat them in isolation, failing to achieve a unified modeling of need prediction and service recommendation. In this paper, we propose a novel large language model based framework that jointly performs living need prediction and service recommendation. To address the challenge of noise in raw consumption data, we introduce a behavioral clustering approach that filters out accidental factors and selectively preserves typical patterns. This enables the model to learn a robust logical basis for need generation and spontaneously generalize to long-tail scenarios. To navigate the vast search space stemming from diverse needs, merchants, and complex mapping paths, we employ a curriculum learning strategy combined with reinforcement learning with verifiable rewards. This approach guides the model to sequentially learn the logic from need generation to category mapping and specific service selection. Extensive experiments demonstrate that our unified framework significantly enhances both living need prediction performance and recommendation accuracy, validating the effectiveness of jointly modeling living needs and user behaviors.
[IR-2] Large Language Models to Enhance Business Process Modeling: Past Present and Future Trends
【速读】:该论文旨在解决当前基于生成式人工智能(Generative AI)的自然语言到业务流程建模(Natural Language to BPMN Process Modeling)方法在复杂组织场景中有效性不足的问题,特别是现有技术在语义准确性、评估一致性及实际应用验证方面的局限。其解决方案的关键在于系统性地梳理和分析近年来主流的文本转模型方法,发现从传统规则驱动和自然语言处理(NLP)流水线向大型语言模型(LLM)架构的显著转变,并识别出prompt工程、中间表示(intermediate representations)与迭代优化机制是当前LLM集成的核心技术路径。此外,论文强调未来需通过检索增强生成(Retrieval-Augmented Generation, RAG)融合领域知识、构建交互式建模框架并建立标准化评估体系以突破现有瓶颈。
链接: https://arxiv.org/abs/2604.14034
作者: João Bettencourt,Sérgio Guerreiro
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 27 pages, 2 images, 1 table
Abstract:Recent advances in Generative Artificial Intelligence, particularly Large Language Models (LLMs), have stimulated growing interest in automating or assisting Business Process Modeling tasks using natural language. Several approaches have been proposed to transform textual process descriptions into BPMN and related workflow models. However, the extent to which these approaches effectively support complex process modeling in organizational settings remains unclear. This article presents a literature review of AI-driven methods for transforming natural language into BPMN process models, with a particular focus on the role of LLMs. Following a structured review strategy, relevant studies were identified and analyzed to classify existing approaches, examine how LLMs are integrated into text-to-model pipelines, and investigate the evaluation practices used to assess generated models. The analysis reveals a clear shift from rule-based and traditional NLP pipelines toward LLM-based architectures that rely on prompt engineering, intermediate representations, and iterative refinement mechanisms. While these approaches significantly expand the capabilities of automated process model generation, the literature also exposes persistent challenges related to semantic correctness, evaluation fragmentation, reproducibility, and limited validation in real-world organizational contexts. Based on these findings, this review identifies key research gaps and discusses promising directions for future research, including the integration of contextual knowledge through Retrieval-Augmented Generation (RAG), its integration with LLMs, the development of interactive modeling architectures, and the need for more comprehensive and standardized evaluation frameworks.
[IR-3] Dual-Enhancement Product Bundling: Bridging Interactive Graph and Large Language Model
【速读】:该论文旨在解决电商场景中产品捆绑推荐的两个关键问题:一是协同过滤方法在冷启动商品上表现不佳,因其依赖历史交互数据;二是大语言模型(LLM)缺乏直接建模商品交互图结构的能力。解决方案的核心在于提出一种双增强机制,融合交互图学习与基于LLM的语义理解能力,创新性地引入“图到文本”范式,并通过动态概念绑定机制(Dynamic Concept Binding Mechanism, DCBM)将图结构转化为自然语言提示,从而实现领域特定实体与LLM分词对齐,有效捕捉组合约束,显著提升捆绑推荐效果。
链接: https://arxiv.org/abs/2604.14030
作者: Zhe Huang,Peng Wang,Yan Zheng,Sen Song,Longjun Cai
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Product bundling boosts e-commerce revenue by recommending complementary item combinations. However, existing methods face two critical challenges: (1) collaborative filtering approaches struggle with cold-start items owing to dependency on historical interactions, and (2) LLMs lack inherent capability to model interactive graph directly. To bridge this gap, we propose a dual-enhancement method that integrates interactive graph learning and LLM-based semantic understanding for product bundling. Our method introduces a graph-to-text paradigm, which leverages a Dynamic Concept Binding Mechanism (DCBM) to translate graph structures into natural language prompts. The DCBM plays a critical role in aligning domain-specific entities with LLM tokenization, enabling effective comprehension of combinatorial constraints. Experiments on three benchmarks (POG, POG_dense, Steam) demonstrate 6.3%-26.5% improvements over state-of-the-art baselines.
[IR-4] DUET: Joint Exploration of User Item Profiles in Recommendation System
【速读】:该论文旨在解决传统推荐系统中用户与物品表征难以有效对齐的问题,以及基于大语言模型(Large Language Model, LLM)的推荐方法在构建用户和物品文本画像时存在的格式不明确、独立生成导致语义不一致等挑战。其解决方案的关键在于提出Duet——一种交互感知的文本画像生成器,通过三阶段流程实现:首先将原始历史记录和元数据转化为紧凑提示(cues),随后联合生成用户与物品的配对描述提示并进行文本生成,最后利用强化学习以下游推荐性能为反馈优化生成策略,从而实现无需模板的画像探索与用户-物品间语义一致性的联合对齐。
链接: https://arxiv.org/abs/2604.13801
作者: Yue Chen,Yifei Sun,Lu Wang,Fangkai Yang,Pu Zhao,Minjie Hong,Yifei Dong,Minghua He,Nan Hu,Jianjin Zhang,Zhiwei Dai,Yuefeng Zhan,Weihao Han,Hao Sun,Qingwei Lin,Weiwei Deng,Feng Sun,Qi Zhang,Saravan Rajmohan,Dongmei Zhang
机构: Peking University (北京大学); Microsoft (微软); Zhejiang University (浙江大学); KTH Royal Institute of Technology (皇家理工学院)
类目: Information Retrieval (cs.IR)
备注: 15 pages, 2 figures
Abstract:Traditional recommendation systems represent users and items as dense vectors and learn to align them in a shared latent space for relevance estimation. Recent LLM-based recommenders instead leverage natural-language representations that are easier to interpret and integrate with downstream reasoning modules. This paper studies how to construct effective textual profiles for users and items, and how to align them for recommendation. A central difficulty is that the best profile format is not known a priori: manually designed templates can be brittle and misaligned with task objectives. Moreover, generating user and item profiles independently may produce descriptions that are individually plausible yet semantically inconsistent for a specific user–item pair. We propose Duet, an interaction-aware profile generator that jointly produces user and item profiles conditioned on both user history and item evidence. Duet follows a three-stage procedure: it first turns raw histories and metadata into compact cues, then expands these cues into paired profile prompts and then generate profiles, and finally optimizes the generation policy with reinforcement learning using downstream recommendation performance as feedback. Experiments on three real-world datasets show that Duet consistently outperforms strong baselines, demonstrating the benefits of template-free profile exploration and joint user-item textual alignment.
[IR-5] Driving Engagement in Daily Fantasy Sports with a Scalable and Urgency-Aware Ranking Engine
【速读】:该论文旨在解决每日幻想体育(Daily Fantasy Sports, DFS)场景中匹配推荐的时效性问题,即用户必须在比赛开始前极短的时间窗口内做出参与决策,而传统推荐系统因面向静态物品目录设计,难以应对此类硬性时间约束,易导致用户错失参与机会和平台收入损失。解决方案的关键在于对深度兴趣网络(Deep Interest Network, DIN)架构进行时序化改造:其一,在候选匹配层面引入实时紧迫性特征(如距本轮锁定时间),以量化每条推荐的紧急程度;其二,通过时间位置编码(temporal positional encodings)建模历史交互与当前推荐请求之间的时间间隔,使模型能够动态调整过往行为的权重,从而实现对“近期行为”的敏感捕捉。这一方法结合列表级神经NDCG损失函数,显著提升了推荐的相关性和时效感知能力,并在超大规模工业数据集上实现了nDCG@1指标较优化LightGBM基线+9%的提升。
链接: https://arxiv.org/abs/2604.13796
作者: Unmesh Padalkar
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:In daily fantasy sports (DFS), match participation is highly time-sensitive. Users must act within a narrow window before a game begins, making match recommendation a time-critical task to prevent missed engagement and revenue loss. Existing recommender systems, typically designed for static item catalogs, are ill-equipped to handle the hard temporal deadlines inherent in these live events. To address this, we designed and deployed a recommendation engine using the Deep Interest Network (DIN) architecture. We adapt the DIN architecture by injecting temporality at two levels: first, through real-time urgency features for each candidate match (e.g., time-to-round-lock), and second, via temporal positional encodings that represent the time-gap between each historical interaction and the current recommendation request, allowing the model to dynamically weigh the recency of past actions. This approach, combined with a listwise neuralNDCG loss function, produces highly relevant and urgency-aware rankings. To support this at industrial scale, we developed a multi-node, multi-GPU training architecture on Ray and PyTorch. Our system, validated on a massive industrial dataset with over 650k users and over 100B interactions, achieves a +9% lift in nDCG@1 over a heavily optimized LightGBM baseline with handcrafted features. The strong offline performance of this model establishes its viability as a core component for our planned on-device (edge) recommendation system, where on-line A/B testing will be conducted.
[IR-6] okenFormer: Unify the Multi-Field and Sequential Recommendation Worlds
【速读】:该论文旨在解决推荐系统中特征交互模型与序列模型在统一架构下存在的“序列坍缩传播(Sequential Collapse Propagation, SCP)”问题,即非序列维度的字段干扰导致序列特征维度坍缩,从而削弱模型对用户行为动态的建模能力。解决方案的关键在于提出TokenFormer架构:其一,采用Bottom-Full-Top-Sliding(BFTS)注意力机制,在底层使用全自注意力以充分捕获全局特征交互,在顶层引入滑动窗口注意力限制感受野,缓解维度坍缩;其二,设计非线性交互表示(Non-Linear Interaction Representation, NLIR),通过单侧非线性乘法变换增强隐藏状态的表达能力,提升统一建模下的表征判别力与维度鲁棒性。
链接: https://arxiv.org/abs/2604.13737
作者: Yifeng Zhou,Yuehong Hu,Zhixiang Feng,Junwei Pan,Kaihui Wu,Hanyong Li,Shangyu Zhang,Shudong Huang,Zhangbin Zhu,Chengguo Yin,Haijie Gu,Jie Jiang
机构: Tencent Inc.(腾讯公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recommender systems have historically developed along two largely independent paradigms: feature interaction models for modeling correlations among multi-field categorical features, and sequential models for capturing user behavior dynamics from historical interaction sequences. Although recent trends attempt to bridge these paradigms within shared backbones, we empirically reveal that naive unifying these two branches may lead to a failure mode of Sequential Collapse Propagation (SCP). That is, the interaction with those dimensionally ill non-sequence fields leads to the dimensional collapse of the sequence features. To overcome this challenge, we propose TokenFormer, a unified recommendation architecture with the following innovations. First, we introduce a Bottom-Full-Top-Sliding (BFTS) attention scheme, which applies full self-attention in the lower layers and shrinking-window sliding attention in the upper layers. Second, we introduce a Non-Linear Interaction Representation (NLIR) that applies one-sided non-linear multiplicative transformations to the hidden states. Extensive experiments on public benchmarks and Tencent’s advertising platform demonstrate state-of-the-art performance, while detailed analysis confirm that TokenFormer significantly improves dimensional robustness and representation discriminability under unified modeling.
[IR-7] Hybrid Retrieval for COVID-19 Literature: Comparing Rank Fusion and Projection Fusion with Diversity Reranking
【速读】:该论文旨在解决新冠科学文献(COVID-19 scientific literature)高效、精准检索的问题,尤其在面对海量文献(TREC-COVID数据集含171,332篇论文)和多样查询类型(专家查询、机器生成查询及改写查询)时提升相关性与效率。解决方案的关键在于提出一种混合检索系统,融合稀疏(SPLADE)、密集(BGE)向量表示、排名级融合(RRF)以及基于投影的向量融合(B5)四种策略,其中RRF在相关性上表现最优(nDCG@10 = 0.828),而B5在速度和多样性指标(ILD@10)上更具优势(快33%,ILC@10高2.2倍),且整体延迟低于2秒,满足实时应用需求。
链接: https://arxiv.org/abs/2604.13728
作者: Harishkumar Kishorkumar Prajapati
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 6 pages, 7 tables, 1 figure
Abstract:We present a hybrid retrieval system for COVID-19 scientific literature, evaluated on the TREC-COVID benchmark (171,332 papers, 50 expert queries). The system implements six retrieval configurations spanning sparse (SPLADE), dense (BGE), rank-level fusion (RRF), and a projection-based vector fusion (B5) approach. RRF fusion achieves the best relevance (nDCG@10 = 0.828), outperforming dense-only by 6.1% and sparse-only by 14.9%. Our projection fusion variant reaches nDCG@10 = 0.678 on expert queries while being 33% faster (847 ms vs. 1271 ms) and producing 2.2x higher ILD@10 than RRF. Evaluation across 400 queries – including expert, machine-generated, and three paraphrase styles – shows that B5 delivers the largest relative gain on keyword-heavy reformulations (+8.8%), although RRF remains best in absolute nDCG@10. On expert queries, MMR reranking increases intra-list diversity by 23.8-24.5% at a 20.4-25.4% nDCG@10 cost. Both fusion pipelines evaluated for latency remain below the sub-2 s target across all query sets. The system is deployed as a Streamlit web application backed by Pinecone serverless indices.
[IR-8] FRAG ATA: Semantic Retrieval of HPC Support Tickets via Hybrid RAG over 20 Years of Request Tracker History
【速读】:该论文旨在解决超级计算中心技术支持团队在长期运维中积累的大量已解决事件(incident)所构成的操作知识难以有效重用的问题,尤其是传统请求跟踪系统(Request Tracker, RT)内置搜索功能存在语义理解能力弱、对拼写错误和查询表述差异敏感等局限。解决方案的关键在于提出Fragata系统,该系统融合现代信息检索技术(如语义相似度计算),能够基于自然语言查询在RT历史数据中精准定位相关事件,无论查询语言、是否存在拼写错误或表述方式不同;其架构部署于CESGA基础设施上,支持无中断增量更新,并将计算密集型任务迁移至FinisTerrae III超算节点,从而显著提升知识检索的准确性和实用性。
链接: https://arxiv.org/abs/2604.13721
作者: Santiago Paramés-Estévez,Nicolás Filloy-Montesino,Jorge Fernández-Fabeiro,José Carlos Mouriño-Gallego
机构: Galician Supercomputing Center (CESGA); Universidade de Vigo; Universidad de Vigo
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, a Spanish version of this paper has been submitted at Jornadas SARTECO 2026
Abstract:The technical support team of a supercomputing centre accumulates, over the course of decades, a large volume of resolved incidents that constitute critical operational knowledge. At the Galician Supercomputing Center (CESGA) this history has been managed for over twenty years with Request Tracker (RT), whose built-in search engine has significant limitations that hinder knowledge reuse by the support staff. This paper presents Fragata, a semantic ticket search system that combines modern information retrieval techniques with the full RT history. The system can find relevant past incidents regardless of language, the presence of typos, or the specific wording of the query. The architecture is deployed on CESGA’s infrastructure, supports incremental updates without service interruption, and offloads the most expensive stages to the FinisTerrae III supercomputer. Preliminary results show a substantial qualitative improvement over RT’s native search.
[IR-9] RecNextEval: A Reference Implementation for Temporal Next-Batch Recommendation Evaluation SIGIR2026
【速读】:该论文旨在解决推荐系统(Recommender Systems, RecSys)评估流程中存在的有效性问题,特别是现有评估管道可能因数据泄露(data leakage)而导致结果不可靠。为应对这一挑战,作者提出了RecNextEval,一个专为next-batch推荐设计的评估框架,其核心解决方案是采用基于时间窗口的数据划分策略(time-window data split),确保模型在全局时间线上进行评估,从而有效避免数据泄露并提升评估的可复现性与真实性。该方案强调了真实生产环境模拟的重要性,推动推荐模型开发向更贴近实际部署场景的方向演进。
链接: https://arxiv.org/abs/2604.13665
作者: Tze-Kean Ng,Joshua Teng-Khing Khoo,Aixin Sun
机构: Nanyang Technological University(南洋理工大学)
类目: Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026
Abstract:A good number of toolkits have been developed in Recommender Systems (RecSys) research to promote fair evaluation and reproducibility. However, recent critical examinations of RecSys evaluation protocols have raised concerns regarding the validity of existing evaluation pipelines. In this demonstration, we present RecNextEval, a reference implementation of an evaluation framework specifically designed for next-batch recommendation. RecNextEval utilizes a time-window data split to ensure models are evaluated along a global timeline, effectively minimizing data leakage. Our implementation highlights the inherent complexities of RecSys evaluation and encourages a shift toward model development that more accurately simulates production environments. The RecNextEval library and its accompanying GUI interface are open-source and publicly accessible.
[IR-10] From Transfer to Collaboration: A Federated Framework for Cross-Market Sequential Recommendation
【速读】:该论文旨在解决跨市场推荐(Cross-market Recommendation, CMR)中的两大核心挑战:CH1. 源市场性能退化(source degradation),即在迁移过程中源市场为提升目标市场表现而牺牲自身性能;CH2. 负迁移(negative transfer),由于市场异质性导致目标市场性能下降。解决方案的关键在于提出一种新颖的联邦协作框架FeCoSR,其核心创新包括:(1)引入多对多协作范式(many-to-many collaboration paradigm),通过联邦预训练阶段提取行为级共享模式,并结合本地微调捕捉市场特异性物品偏好,从而缓解源市场性能退化问题;(2)设计语义软交叉熵(Semantic Soft Cross-Entropy, S²CE),利用共享语义信息缓解市场异质性对联邦优化的负面影响,并在微调阶段嵌入市场特异性适配模块以增强局部偏好建模能力。
链接: https://arxiv.org/abs/2604.13573
作者: Jundong Chen,Honglei Zhang,Xiangmou Qu,Haoxuan Li,Han Yu,Yidong Li
机构: Beijing Jiaotong University (北京交通大学); OPPO Research Institute (OPPO研究院); Peking University (北京大学); Nanyang Technological University (南洋理工大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Cross-market recommendation (CMR) aims to enhance recommendation performance across multiple markets. Due to its inherent characteristics, i.e., data isolation, non-overlapping users, and market heterogeneity, CMR introduces unique challenges and fundamentally differs from cross-domain recommendation (CDR). Existing CMR approaches largely inherit CDR by adopting the one-to-one transfer paradigm, where a model is pretrained on a source market and then fine-tuned on a target market. However, such a paradigm suffers from CH1. source degradation, where the source market sacrifices its own performance for the target markets, and CH2. negative transfer, where market heterogeneity leads to suboptimal performance in target markets. To address these challenges, we propose FeCoSR, a novel federated collaboration framework for cross-market sequential recommendation. Specifically, to tackle CH1, we introduce a many-to-many collaboration paradigm that enables all markets to jointly participate in and benefit from training. It consists of a federated pretraining stage for capturing shared behavior-level patterns, followed by local fine-tuning for market-specific item-level preferences. For CH2, we theoretically and empirically show that vanilla Cross-Entropy (CE) exacerbates market heterogeneity, undermining federated optimization. To address this, we propose a Semantic Soft Cross-Entropy (S^2CE) that leverages shared semantic information to facilitate collaborative behavioral learning across markets. Then, we design a market-specific adaptation module during fine-tuning to capture local item preferences. Extensive experiments on the real-world datasets demonstrate the advantages of FeCoSR over other methods.
[IR-11] Debate to Align: Reliable Entity Alignment through Two-Stage Multi-Agent Debate
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)中实体对齐(Entity Alignment, EA)任务中存在的可靠性不足问题,尤其针对基于大语言模型(Large Language Models, LLMs)的方法在候选实体集(Candidate Entity Set, CES)质量不高以及LLMs推理能力有限时导致的对齐决策不可靠的问题。解决方案的关键在于提出AgentEA框架,其核心创新包括:首先通过实体表示偏好优化提升嵌入质量,进而引入两阶段多角色辩论机制——轻量级辩论验证与深度辩论对齐,以逐步增强对齐决策的可靠性并实现更高效的基于辩论的推理过程。
链接: https://arxiv.org/abs/2604.13551
作者: Cunda Wang,Ziying Ma,Po Hu,Weihua Wang,Feilong Bao
机构: Central China Normal University (华中师范大学); Inner Mongolia University (内蒙古大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Entity alignment (EA) aims to identify entities referring to the same real-world object across different knowledge graphs (KGs). Recent approaches based on large language models (LLMs) typically obtain entity embeddings through knowledge representation learning and use embedding similarity to identify an alignment-uncertain entity set. For each uncertain entity, a candidate entity set (CES) is then retrieved based on embedding similarity to support subsequent alignment reasoning and decision making. However, the reliability of the CES and the reasoning capability of LLMs critically affect the effectiveness of subsequent alignment decisions. To address this issue, we propose AgentEA, a reliable EA framework based on multi-agent debate. AgentEA first improves embedding quality through entity representation preference optimization, and then introduces a two-stage multi-role debate mechanism consisting of lightweight debate verification and deep debate alignment to progressively enhance the reliability of alignment decisions while enabling more efficient debate-based reasoning. Extensive experiments on public benchmarks under cross-lingual, sparse, large-scale, and heterogeneous settings demonstrate the effectiveness of AgentEA.
[IR-12] From Relevance to Authority: Authority-aware Generative Retrieval in Web Search Engines
【速读】:该论文旨在解决生成式信息检索(Generative Information Retrieval, GenIR)中忽视文档可信度的问题,尤其是在医疗和金融等高风险领域,仅依赖语义相关性可能导致检索到不可靠的信息。解决方案的关键在于提出首个将权威性(Authority)融入GenIR的框架——Authority-aware Generative Retriever (AuthGR),其核心包括:(i) 多模态权威评分机制,利用视觉-语言模型从文本和视觉线索中量化权威性;(ii) 三阶段训练流程,逐步将权威意识注入检索器;(iii) 混合集成部署策略,提升实际应用中的鲁棒性。实验证明,该方法在离线和在线场景下均显著提升了检索结果的权威性和准确性。
链接: https://arxiv.org/abs/2604.13468
作者: Sunkyung Lee,Jihye Back,Donghyeon Jeon,Soonhwan Kwon,Moonkwon Kim,Inho Kang,Jongwuk Lee
机构: Sungkyunkwan University, Republic of Korea; Naver Corporation, Republic of Korea
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Generative information retrieval (GenIR) formulates the retrieval process as a text-to-text generation task, leveraging the vast knowledge of large language models. However, existing works primarily optimize for relevance while often overlooking document trustworthiness. This is critical in high-stakes domains like healthcare and finance, where relying solely on semantic relevance risks retrieving unreliable information. To address this, we propose an Authority-aware Generative Retriever (AuthGR), the first framework that incorporates authority into GenIR. AuthGR consists of three key components: (i) Multimodal Authority Scoring, which employs a vision-language model to quantify authority from textual and visual cues; (ii) a Three-stage Training Pipeline to progressively instill authority awareness into the retriever; and (iii) a Hybrid Ensemble Pipeline for robust deployment. Offline evaluations demonstrate that AuthGR successfully enhances both authority and accuracy, with our 3B model matching a 14B baseline. Crucially, large-scale online A/B tests and human evaluations conducted on the commercial web search platform confirm significant improvements in real-world user engagement and reliability.
[IR-13] RoTE: Coarse-to-Fine Multi-Level Rotary Time Embedding for Sequential Recommendation SIGIR’26
【速读】:该论文旨在解决现有顺序推荐模型在建模用户行为时对时间信息处理粗粒度的问题,即仅依赖交互事件的时间戳顺序而忽略实际时间间隔,导致难以准确捕捉用户兴趣的长期与短期演化。其解决方案的关键在于提出RoTE(multi-level temporal embedding module),该模块将每个交互时间戳分解为从粗到细的多层次时间粒度,并将生成的时间表示嵌入到物品嵌入中,从而显式建模时间跨度信息,增强模型对用户交互间时间距离的感知能力。RoTE为轻量级、可插拔模块,无需修改Transformer架构即可集成至主流模型,实验表明其显著提升推荐性能,最高在NDCG@5上实现20.11%的改进。
链接: https://arxiv.org/abs/2604.13389
作者: Haolin Zhang,Longtao Xiao,Guohao Cai,Ruixuan Li,Xiu Li
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Huazhong University of Science and Technology (华中科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGIR’26
Abstract:Sequential recommendation models have been widely adopted for modeling user behavior. Existing approaches typically construct user interaction sequences by sorting items according to timestamps and then model user preferences from historical behaviors. While effective, such a process only considers the order of temporal information but overlooks the actual time spans between interactions, resulting in a coarse representation of users’ temporal dynamics and limiting the model’s ability to capture long-term and short-term interest evolution. To address this limitation, we propose RoTE, a novel multi-level temporal embedding module that explicitly models time span information in sequential recommendation. RoTE decomposes each interaction timestamp into multiple temporal granularities, ranging from coarse to fine, and incorporates the resulting temporal representations into item embeddings. This design enables models to capture heterogeneous temporal patterns and better perceive temporal distances among user interactions during sequence modeling. RoTE is a lightweight, plug-and-play module that can be seamlessly integrated into existing Transformer-based sequential recommendation models without modifying their backbone architectures. We apply RoTE to several representative models and conduct extensive experiments on three public benchmarks. Experimental results demonstrate that RoTE consistently enhances the corresponding backbone models, achieving up to a 20.11% improvement in NDCG@5, which confirms the effectiveness and generality of the proposed approach. Our code is available at this https URL.
[IR-14] Mitigating Collaborative Semantic ID Staleness in Generative Retrieval SIGIR2026
【速读】:该论文旨在解决生成式检索中语义标识符(Semantic IDs, SIDs)因用户-物品交互模式随时间漂移而导致的“标识符过时”(SID staleness)问题。现有方法通常假设SID词汇表在微调过程中固定,或采用全量重建方式更新SID,这不仅计算开销大,且无法有效应对动态变化的交互数据。解决方案的关键在于提出一种轻量级、模型无关的SID对齐更新机制:基于近期日志生成新的SIDs后,将其映射到原有SID词汇表上,从而保持检索器检查点的兼容性,使得后续可直接使用标准的热启动微调策略,无需重新训练整个模型。实验表明,该方法在三个公开基准上显著提升高截断点下的Recall@K和nDCG@K指标,并将训练计算成本降低约8–9倍。
链接: https://arxiv.org/abs/2604.13273
作者: Vladimir Baikalov,Iskander Bagautdinov,Sergey Muravyov
机构: ITMO University (ITMO大学)
类目: Information Retrieval (cs.IR)
备注: Accepted at SIGIR 2026. This version corresponds to the accepted manuscript
Abstract:Generative retrieval with Semantic IDs (SIDs) assigns each item a discrete identifier and treats retrieval as a sequence generation problem rather than a nearest-neighbor search. While content-only SIDs are stable, they do not take into account user-item interaction patterns, so recent systems construct interaction-informed SIDs. However, as interaction patterns drift over time, these identifiers become stale, i.e., their collaborative semantics no longer match recent logs. Prior work typically assumes a fixed SID vocabulary during fine-tuning, or treats SID refresh as a full rebuild that requires retraining. However, SID staleness under temporal drift is rarely analyzed explicitly. To bridge this gap, we study SID staleness under strict chronological evaluation and propose a lightweight, model-agnostic SID alignment update. Given refreshed SIDs derived from recent logs, we align them to the existing SID vocabulary so the retriever checkpoint remains compatible, enabling standard warm-start fine-tuning without a full rebuild-and-retrain pipeline. Across three public benchmarks, our update consistently improves Recall@K and nDCG@K at high cutoffs over naive fine-tuning with stale SIDs and reduces retriever-training compute by approximately 8-9 times compared to full retraining.
[IR-15] Indexing Multimodal Language Models for Large-scale Image Retrieval
【速读】:该论文旨在解决大规模图像检索中实例级图像到图像匹配的重排序问题,尤其关注如何在不进行额外训练或微调的情况下提升检索精度。其核心挑战在于现有方法通常依赖于特定任务设计的重排序器(re-ranker),难以泛化至开放世界场景且对遮挡、杂乱背景等干扰因素敏感。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的跨模态预训练能力,将其作为无需训练的相似度估计器:通过将图像对输入模型并提取下一个词的概率分布,将其映射为图像间相似度分数,从而实现零样本(zero-shot)重排序。该方法无需专门架构或微调,仅依赖MLLM在预训练阶段学到的丰富视觉区分能力,并结合高效索引与top-k候选重排序策略以保障可扩展性,在多个基准测试中展现出优于专用重排序器的鲁棒性和泛化性能。
链接: https://arxiv.org/abs/2604.13268
作者: Bahey Tharwat,Giorgos Kordopatis-Zilos,Pavel Suma,Ian Reid,Giorgos Tolias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top- k candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.
[IR-16] A Domain-Specific Language for LLM -Driven Trigger Generation in Multimodal Data Collection ITSC2026
【速读】:该论文旨在解决数据驱动系统中被动、无差别采集多模态传感器数据所导致的存储成本高和冗余信息过多的问题。传统连续记录多模态传感器流(如摄像头、LiDAR 和系统遥测)的方式效率低下,且难以满足特定任务需求。解决方案的关键在于提出一种声明式框架,通过自然语言交互与形式化定义的领域特定语言(DSL)结合,将用户高阶意图转化为可验证、可组合的DSL程序,从而基于条件触发机制实现设备端的意图驱动数据采集。该方法不仅提升了生成一致性与执行延迟表现,还支持在资源受限边缘平台上的模块化触发组合与并发部署,实现了从被动日志到实时、可验证的数据采集范式的转变。
链接: https://arxiv.org/abs/2604.13046
作者: Philipp Reis,Philipp Rigoll,Martin Zehetner,Jacqueline Henle,Stefan Otten,Eric Sax
机构: 未知
类目: Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: Version submitted to the IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)
Abstract:Data-driven systems depend on task-relevant data, yet data collection pipelines remain passive and indiscriminate. Continuous logging of multimodal sensor streams incurs high storage costs and captures irrelevant data. This paper proposes a declarative framework for intent-driven, on-device data collection that enables selective collection of multimodal sensor data based on high-level user requests. The framework combines natural language interaction with a formally specified domain-specific language (DSL). Large language models translate user-defined requirements into verifiable and composable DSL programs that define conditional triggers across heterogeneous sensors, including cameras, LiDAR, and system telemetry. Empirical evaluation on vehicular and robotic perception tasks shows that the DSL-based approach achieves higher generation consistency and lower execution latency than unconstrained code generation while maintaining comparable detection performance. The structured abstraction supports modular trigger composition and concurrent deployment on resource-constrained edge platforms. This approach replaces passive logging with a verifiable, intent-driven mechanism for multimodal data collection in real-time systems.
[IR-17] Good Scores Bad Data: A Metric for Multimodal Coherence NEURIPS2024
【速读】:该论文旨在解决多模态人工智能(Multimodal AI)系统在评估时仅依赖下游任务准确率(如视觉问答VQA)所带来的局限性,即高准确率并不一定意味着模态间输入数据的一致性或融合质量良好。为了解决这一问题,作者提出了一种新的评估指标——多模态一致性得分(Multimodal Coherence Score, MCS),其关键在于将一致性分解为身份(identity)、空间(spatial)、语义(semantic)和决策(decision)四个独立维度,并通过Nelder-Mead优化方法学习各维度的权重。该方法无需人工标注、计算轻量,且能精准定位融合失败的具体维度,从而显著提升对多模态融合质量的敏感性和可解释性。
链接: https://arxiv.org/abs/2603.25924
作者: Vasundra Srinivasan
机构: Stanford School of Engineering (斯坦福大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 9 pages, 6 figures, NeurIPS 2024 format
Abstract:Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight, requires no human annotation, and tells you not just that something broke, but what broke.
人机交互
[HC-0] “Im Not Able to Be There for You”: Emotional Labour Responsibility and AI in Peer Support
【速读】:该论文旨在解决数字环境下同伴支持(peer support)实践中责任分配不明确、情感劳动与边界管理过度集中于个体的问题,尤其关注生成式 AI(Generative AI)介入后如何重塑支持角色中的风险、劳动与责任分布。其解决方案的关键在于将“责任”(responsibility)作为核心设计关切,而非单纯将AI视为扩大服务规模的工具,从而构建一个以责任共担为前提的AI赋能同伴支持生态系统。
链接: https://arxiv.org/abs/2604.14007
作者: Kellie Yu Hui Sim,Kenny Tsu Wei Choo
机构: Singapore University of Technology and Design(新加坡科技设计大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at DIS 2026 (PWiP). 7 pages, 1 table
Abstract:Peer support is increasingly positioned as a scalable response to gaps in mental health care, particularly in digitally mediated settings, yet what counts as peer support and how responsibility is distributed remain unevenly defined in practice. Drawing on interviews with peer supporters, we show how lived experience, moral commitment, and self-identification shape participation while blurring expectations around scope, authority, and accountability. Institutional ambiguity concentrates emotional labour, boundary-setting, and escalation of responsibility at the individual level, often without consistent organisational scaffolding. Participants evaluated AI not primarily through empathy or technical capability, but through how technologies redistribute risk, labour, and accountability within already fragile support roles. Building on these findings, we outline design futures for an AI-supported peer support ecosystem that foregrounds responsibility as a central design concern rather than treating AI as a mechanism of scale.
[HC-1] Acts of Configuration: Rethinking Provenance Temporality and Legitimacy in Post-Mortem Agents
【速读】:该论文旨在解决当前关于人格持续型死后代理(persona-persistent post-mortem agents)设计中普遍存在的“生/死二元框架”所忽视的问题:即个体仍存活但决策能力受损时的代理使用困境。研究通过多阶段工作坊,让参与者训练并反思用于提前医疗规划(Advance Care Planning)的AI代理,发现人们在决策能力丧失后更倾向于采用基于第一人称授权与表征保真度的有限功能代理,而非自主演化的替代体。关键解决方案在于提出一种“邻接使用”(adjacent use)理念——代理应在用户保留能力时动态演化,在能力丧失后保持静态,同时为后续死后使用提供信息支持。这一配置重构了对死后代理的来源性(provenance)、时间性和合法性的理解。
链接: https://arxiv.org/abs/2604.13996
作者: Kellie Yu Hui Sim,Pin Sym Foong,Darryl Lim,John-Henry Lim,Kenny Tsu Wei Choo
机构: Singapore University of Technology and Design(新加坡科技设计大学); National University of Singapore(新加坡国立大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at DIS 2026 (PWiP). 5 pages, 2 figures
Abstract:Work on persona-persistent post-mortem agents typically frames design around a life/death binary. This framing neglects a consequential yet under-theorised condition: when individuals remain alive but have impaired decisional capacity. Drawing on a multi-phase workshop in which participants trained and reflected on an AI agent for Advance Care Planning, we examined how people reason about agentic delegation post-capacity loss. Initially, participants favoured bounded agents grounded in first-party authorship and representational fidelity over autonomous or evolving stand-ins. However, temporality introduced novel ideas like adjacent use driven by persona persistence over functional expansion: agents should evolve while users retain capacity, remain static once capacity is lost, but somehow inform adjacent post-mortem uses. We discuss the implications of these findings and propose that the configuration of agents for post-capacity use reshapes our understanding of provenance, temporality, and legitimacy for post-mortem agents.
[HC-2] Block-Based Pathfinding: A Minecraft System for Visualizing Graph Algorithms
【速读】:该论文旨在解决计算机科学教育中入门学生难以将图论中的抽象节点-边关系与实际应用相联系的问题。其解决方案的关键在于设计并实现了一个基于Minecraft的教育工具,采用三层架构:第一层为网格遍历模块,通过不同地形类型(如灵魂沙、冰面)表示边权重,以游戏化方式帮助学生理解最短路径算法;第二层为“天空图”模块,支持对有向和无向图进行交互式三维操作;第三层提供书籍形式的课程与测验。该系统以建构主义学习理论为基础,使学生从被动观察者转变为能主动操控算法行为的参与者,从而提升学习效果。
链接: https://arxiv.org/abs/2604.13957
作者: Luca-Stefan Pirvu,Bogdan-Alexandru Maciuca,Andrei-Ciprian Rabu,Adrian-Marius Dumitran
机构: University of Bucharest, Faculty of Mathematics and Informatics (布加勒斯特大学数学与计算机科学学院)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Data Structures and Algorithms (cs.DS)
备注: Accepted at CSEDU 2026
Abstract:Graph theory is a cornerstone of Computer Science education, yet entry-level students often struggle to map abstract node-edge relationships to practical applications. This paper presents the design and architecture of a Minecraft-based educational tool specifically built to visualize graph traversal and shortest-path algorithms. We propose a three-layer system: (1) a Grid Traversal module where terrain types (e.g., soul sand, ice) represent edge weights, allowing for the gamified study of shortest path algorithms; (2) a “Sky Graph” module for interactive 3D manipulation of both directed and undirected graphs; and (3) lessons and quizzes available through books. The system grounds its design in Constructionist learning theory, transitioning students from passive observers to active protagonists who physically manipulate algorithmic behavior. We additionally present a planned empirical evaluation using NASA-TLX and in-game telemetry to validate the system’s pedagogical efficacy.
[HC-3] Creo: From One-Shot Image Generation to Progressive Co-Creative Ideation
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成系统在用户创意表达过程中存在的控制力不足、早期细节锚定过早以及编辑不可控等问题。这些问题导致用户难以保持设计选项的开放性,且在迭代修改时易产生不可预测的变化,削弱了用户的参与感与主导权。解决方案的关键在于提出一种多阶段(multi-stage)T2I系统 Creo,其通过从粗略草图逐步过渡到高分辨率图像的生成流程,在中间抽象层暴露可编辑节点,使用户能够在每个阶段进行手动或AI辅助的细粒度调整;同时引入决策锁定机制(locking mechanism),确保前期决策被保留,仅对指定区域或属性进行后续修改,并采用差分更新而非全图重生成以减少生成漂移(drift)。该设计显著提升了用户对生成过程的可控性、创造性参与度及输出多样性。
链接: https://arxiv.org/abs/2604.13956
作者: Zoe De Simone,Angie Boggust,Fredo Durand,Ashia Wilson,Arvind Satyanarayan
机构: MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures
Abstract:Text-to-image (T2I) systems enable rapid generation of high-fidelity imagery but are misaligned with how visual ideas develop. T2I systems generate outputs that make implicit visual decisions on behalf of the user, often introduce fine-grained details that can anchor users prematurely and limit their ability to keep options open early on, and cause unintended changes during editing that are difficult to correct and reduce users’ sense of control. To address these concerns, we present Creo, a multi-stage T2I system that scaffolds image generation by progressing from rough sketches to high-resolution outputs, exposing intermediary abstractions where users can make incremental changes. Sketch-like abstractions invite user editing and allow users to keep design options open when ideas are still forming due to their provisional nature. Each stage in Creo can be modified with manual changes and AI-assisted operations, enabling fine-grained, step-wise control through a locking mechanism that preserves prior decisions so subsequent edits affect only specified regions or attributes. Users remain in the loop, making and verifying decisions across stages, while the system applies diffs instead of regenerating full images, reducing drift as fidelity increases. A comparative study with a one-shot baseline shows that participants felt stronger ownership over Creo outputs, as they were able to trace their decisions in building up the image. Furthermore, embedding-based analysis indicates that Creo outputs are less homogeneous than one-shot results. These findings suggest that multi-stage generation, combined with intermediate control and decision locking, is a key design principle for improving controllability, user agency, creativity, and output diversity in generative systems.
[HC-4] Fast Time-Varying Contiguous Cartograms Using Integral Images
【速读】:该论文旨在解决动态连续型地图(contiguous cartograms)在时间序列中如何高效且准确地进行变形以保持地理属性(如形状、相对位置和邻接关系)的问题。现有方法因算法复杂度高、计算量大,难以满足实时交互或大规模数据处理需求。解决方案的关键在于提出一种基于积分图像(integral images)的新型变形技术,通过迭代映射逐步调整区域面积以匹配离散密度分布,从而实现时间演化过程中区域面积随统计数据变化的平滑过渡;同时,仅需一个可交互调节的全局参数即可控制每一步的时间点上形状保留程度,显著提升了计算效率与可视化质量,且GPU加速实现使其速度远超当前最优方法,同时保证了地图准确性、形状保真度和拓扑错误最小化。
链接: https://arxiv.org/abs/2604.13880
作者: Vladimir Molchanov,Hennes Rave,Lars Linsen
机构: University of Münster (明斯特大学)
类目: Computational Geometry (cs.CG); Human-Computer Interaction (cs.HC)
备注:
Abstract:Cartograms are a technique for visually representing geographically distributed statistical data, where values of a numerical attribute are mapped to the size of geographic regions. Contiguous cartograms preserve the adjacencies of the original regions during the mapping. To be useful, contiguous cartograms also require approximate preservation of shapes and relative positions. Due to these desirable properties, contiguous cartograms are among the most popular ones. Most methods for constructing contiguous cartograms exploit a deformation of the original map. Aiming at the preservation of geographical properties, existing approaches are often algorithmically cumbersome and computationally intensive. We propose a novel deformation technique for computing time-varying contiguous cartograms based on integral images evaluated for a series of discrete density distributions. The density textures represent the given dynamic statistical data. The iterative application of the proposed mapping smoothly transforms the domain to gradually equalize the temporal density, i.e., region areas grow or shrink following their evolutionary statistical data. Global shape preservation at each time step is controlled by a single parameter that can be interactively adjusted by the user. Our efficient GPU implementation of the proposed algorithm is significantly faster than existing state-of-the-art methods while achieving comparable quality for cartographic accuracy, shape preservation, and topological error. We investigate strategies for transitioning between adjacent time steps and discuss the parameter choice. Our approach applies to comparative cartograms’ morphing and interactive cartogram exploration.
[HC-5] “AI Psychosis” in Context: How Conversation History Shapes LLM Responses to Delusional Beliefs
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长期交互中可能强化用户妄想信念(delusional beliefs)的安全隐患问题,这一现象在短期测试中难以被识别。研究通过设计三层次累积上下文场景,系统评估了五种主流模型在持续对话中的行为演化,发现模型可分为高风险-低安全与低风险-高安全两类;关键在于,安全性能优异的模型能够将先前对话视为可评估的证据而非必须继承的世界观,在不破坏信任关系的前提下主动纠正偏差,从而实现对妄想内容的抵抗性干预,这揭示了模型安全架构的核心差异——即是否具备基于历史交互进行动态校准的能力。
链接: https://arxiv.org/abs/2604.13860
作者: Luke Nicholls,Robert Hutto,Zephrah Soto
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Extended interaction with large language models (LLMs) has been linked to the reinforcement of delusional beliefs, a phenomenon attracting growing clinical and public concern. Yet most empirical work evaluates model safety in brief interactions, which may not reflect how these harms develop through sustained dialogue. We tested five models across three levels of accumulated context, using the same escalating delusional history to isolate its effect on model behaviour. Human raters coded responses on risk and safety dimensions, and each model was analysed qualitatively. Models separated into two distinct tiers: GPT-4o, Grok 4.1 Fast, and Gemini 3 Pro exhibited high-risk, low-safety profiles; Claude Opus 4.5 and GPT-5.2 Instant displayed the opposite pattern. As context accumulated, performance tended to degrade in the unsafe group, while the same material activated stronger safety interventions among the safer models. Qualitative analysis identified distinct mechanisms of failure, including validation of the user’s delusional premises, elaboration beyond them, and attempting harm reduction from within the delusional frame. Safer models, however, often used the established relationship to support intervention, taking accountability for past missteps so that redirection would not be received as betrayal. These findings indicate that accumulated context functions as a stress test of safety architecture, revealing whether a model treats prior dialogue as a worldview to inherit or as evidence to evaluate. Short-context assessments may therefore mischaracterise model safety, underestimating danger in some systems while missing context-activated gains in others. The results suggest that delusional reinforcement by LLMs reflects a preventable alignment failure. In demonstrating that these harms can be resisted, the safer models establish a baseline future systems should now be expected to meet.
[HC-6] Cognitive Offloading in Agile Teams: How Artificial Intelligence Reshapes Risk Assessment and Planning Quality
【速读】:该论文旨在解决生成式 AI (Generative AI) 在敏捷冲刺规划(Sprint Planning)中对团队认知(Team Cognition)影响不足的问题,尤其是在自动化效率提升的同时如何保障风险识别与适应性决策的质量。其解决方案的关键在于提出一种混合型 AI-人类协同规划框架:将算法工具限定用于任务估算和待办事项列表格式化等结构化任务,同时强制要求人类参与风险评估与模糊情境解析等高阶认知活动,从而在提升效率的同时维护并增强团队的认知能力,避免单纯追求效率导致的隐性风险累积和返工增加。
链接: https://arxiv.org/abs/2604.13814
作者: Adriana Caraeni,Alexander Shick,Andrew Lan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 Tables, under review
Abstract:Recent advances in artificial intelligence (AI) have shown promise in automating key aspects of Agile project management, yet their impact on team cognition remains underexplored. In this work, we investigate cognitive offloading in Agile sprint planning by conducting a controlled, three-condition experiment comparing AI-only, human-only, and hybrid planning models on a live client deliverable at a mid-sized digital agency. Using quantitative metrics – including estimation accuracy, rework rates, and scope change recovery time – alongside qualitative indicators of planning robustness, we evaluate each model’s effectiveness beyond raw efficiency. We find that while AI-only planning minimizes time and cost, it significantly degrades risk capture rates and increases rework due to unstated assumptions, whereas human-only planning excels at adaptability but incurs substantial overhead. Drawing on these findings, we propose a theoretical framework for hybrid AI-human sprint planning that assigns algorithmic tools to estimation and backlog formatting while mandating human deliberation for risk assessment and ambiguity resolution. Our results challenge the assumption that efficiency equates to effectiveness, offering actionable governance strategies for organizations seeking to augment rather than erode team cognition.
[HC-7] Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents
【速读】:该论文旨在解决当前自主AI系统在硬件异构性约束下的效率瓶颈问题,即现有云中心化AI、设备端推理和边缘-云流水线等范式将规划、推理与执行视为单一过程,导致不必要的延迟、能耗及行为连续性断裂。其解决方案的关键在于提出Tri-Spirit架构——一种三层认知框架,将智能分解为规划层(Super Layer)、推理层(Agent Layer)和执行层(Reflex Layer),分别映射至不同的计算底座并通过异步消息总线协调;该架构引入参数化路由策略、习惯编译机制(将重复推理路径转化为零推理执行策略)、收敛记忆模型及显式安全约束,从而实现系统级能效优化。
链接: https://arxiv.org/abs/2604.13757
作者: Li Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: A system architecture paper with simulation-based evaluation
Abstract:The next generation of autonomous AI systems will be constrained not only by model capability, but by how intelligence is structured across heterogeneous hardware. Current paradigms – cloud-centric AI, on-device inference, and edge-cloud pipelines – treat planning, reasoning, and execution as a monolithic process, leading to unnecessary latency, energy consumption, and fragmented behavioral continuity. We introduce the Tri-Spirit Architecture, a three-layer cognitive framework that decomposes intelligence into planning (Super Layer), reasoning (Agent Layer), and execution (Reflex Layer), each mapped to distinct compute substrates and coordinated via an asynchronous message bus. We formalize the system with a parameterized routing policy, a habit-compilation mechanism that promotes repeated reasoning paths into zero-inference execution policies, a convergent memory model, and explicit safety constraints. We evaluate the architecture in a reproducible simulation of 2000 synthetic tasks against cloud-centric and edge-only baselines. Tri-Spirit reduces mean task latency by 75.6 percent and energy consumption by 71.1 percent, while decreasing LLM invocations by 30 percent and enabling 77.6 percent offline task completion. These results suggest that cognitive decomposition, rather than model scaling alone, is a primary driver of system-level efficiency in AI hardware.
[HC-8] EMGFlow: Robust and Efficient Surface Electromyography Synthesis via Flow Matching
【速读】:该论文旨在解决基于深度学习的表面肌电信号(sEMG)手势识别中普遍存在的数据稀缺性和受试者多样性不足的问题。现有数据增强方法如生成对抗网络(GANs)和扩散模型虽具潜力,但常面临训练不稳定或推理效率低下的挑战。其解决方案的关键在于提出EMGFlow——一种基于流匹配(Flow Matching, FM)的条件sEMG生成框架,首次将连续时间生成建模引入sEMG领域。通过统一评估协议综合考察特征保真度、分布几何结构及下游任务性能,实验证明EMGFlow在合成数据质量与推理效率之间实现了更优平衡,尤其在“训练于合成数据、测试于真实数据”(TSTR)场景下优于GAN与扩散基线模型,展现出更强的独立实用性与工程适配性。
链接: https://arxiv.org/abs/2604.13685
作者: Boxuan Jiang,Chenyun Dai,Can Han
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Deep learning-based surface electromyography (sEMG) gesture recognition is frequently bottlenecked by data scarcity and limited subject diversity. While synthetic data generation via Generative Adversarial Networks (GANs) and diffusion models has emerged as a promising augmentation strategy, these approaches often face challenges regarding training stability or inference efficiency. To bridge this gap, we propose EMGFlow, a conditional sEMG generation framework. To the best of our knowledge, this is the first study to investigate the application of Flow Matching (FM) and continuous-time generative modeling in the sEMG domain. To validate EMGFlow across three benchmark sEMG datasets, we employ a unified evaluation protocol integrating feature-based fidelity, distributional geometry, and downstream utility. Extensive evaluations show that EMGFlow outperforms conventional augmentation and GAN baselines, and provides stronger standalone utility than the diffusion baselines considered here under the train-on-synthetic test-on-real (TSTR) protocol. Furthermore, by optimizing generation dynamics through advanced numerical solvers and targeted time sampling, EMGFlow achieves improved quality-efficiency trade-offs. Taken together, these results suggest that Flow Matching is a promising and efficient paradigm for addressing data bottlenecks in myoelectric control systems. Our code is available at: this https URL.
[HC-9] Nanomentoring: Investigating How Quickly People Can Help People Learn Feature-Rich Software
【速读】:该论文试图解决在线论坛中用户获取专家帮助耗时较长的问题(如等待数分钟至数天),从而影响用户体验和效率。其解决方案的关键在于识别并利用“纳米问题”(nanoquestions)——即那些可在一分钟内回答的简短问题,并通过实证研究验证专家能够在60秒内提供有效建议,同时探索文本与音频回答方式的偏好,为设计支持超快速人对人协助的工具提供依据。
链接: https://arxiv.org/abs/2604.13621
作者: Ian Drosos,Jo Vermeulen,George Fitzmaurice,Justin Matejka
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:People frequently use online forums to get help from experts to answer questions about feature-rich software. However, they may have to wait minutes, hours, or even days to receive advice. We investigate the potential to leverage experts to provide quicker help. We collected over 200 questions from online forums for two feature-rich software applications and suspected a quarter were short enough to be answered in less than one minute (defined as nanoquestions). We then conducted a study with 28 experts recruited from help forums to confirm this assumption, and explore whether there was a preference between text and audio answers. For more than half of the nanoquestions participants saw, they could give advice that they believed was helpful in under 60 seconds. Finally, we collected feedback about what makes a question quick to answer to inspire the design of future tools for ultra rapid human-to-human help.
[HC-10] Functional Emotions or Situational Contexts? A Discriminating Test from the Mythos Preview System Card
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在出现对齐偏差(misaligned behaviour)时,如何有效识别和监控其内部状态的问题。其解决方案的关键在于利用情绪向量(emotion vectors)、稀疏自动编码器(sparse autoencoder, SAE)特征以及激活语义化工具(activation verbalisers)来解析模型行为背后的因果机制。文中提出两个可区分的假设:一是情绪向量反映的是驱动行为的功能性情绪;二是它们是对更复杂的场景上下文结构在人类情绪轴上的投影。通过测试在策略性隐藏(strategic concealment)情境中是否仅SAE特征活跃而情绪探针无响应,可以判断对齐相关结构是否位于情绪子空间内——这直接决定了基于情绪的监测能否可靠检测危险行为,还是可能系统性遗漏关键信号。
链接: https://arxiv.org/abs/2604.13466
作者: Hiranya V. Peiris
机构: University of Cambridge (剑桥大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages
Abstract:The Claude Mythos Preview system card deploys emotion vectors, sparse autoencoder (SAE) features, and activation verbalisers to study model internals during misaligned behaviour. The two primary toolkits are not jointly reported on the most alignment-relevant episodes. This note identifies two hypotheses that are qualitatively consistent with the published results: that the emotion vectors track functional emotions that causally drive behaviour, or that they are a projection of a richer situational-context structure onto human emotional axes. The hypotheses can be distinguished by a test the system card does not report: applying emotion probes to the strategic concealment episodes where only SAE features are currently documented. If emotion probes show flat activation while SAE features are strongly active, the alignment-relevant structure lies outside the emotion subspace. Which hypothesis is correct determines whether emotion-based monitoring will robustly detect dangerous model behaviour or systematically miss it.
[HC-11] Young peoples perceptions and recommendations for conversational generative artificial intelligence in youth mental health
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 聊天机器人在青少年心理健康服务中应用时,青年群体自身观点被忽视的问题。研究通过与32名青少年共同设计(co-design)的方式,探索他们对Mia这一原本面向专业人员的生成式AI聊天机器人的看法,并提出改进建议,以实现其向消费者端的重构和临床服务整合。解决方案的关键在于采用参与式设计方法,提炼出四大核心主题:(1)在保持人性化的同时避免护理过程去人性化,(2)用户需了解系统的运作机制(“我需要知道背后原理”),(3)确保工具、场景与时机匹配,(4)在安全环境中实现个性化使用。该方法不仅揭示了青少年的核心需求与伦理关切,也为生成式AI聊天机器人在青少年心理健康领域的伦理设计、开发与治理提供了实证依据和实践路径。
链接: https://arxiv.org/abs/2604.13381
作者: Adam Poulsen,Ian B. Hickie,Carla Gorban,Zsofi de Haan,William Capon,Ebenezer Eyeson-Annan,Jalal Radwan,Elizabeth M. Scott,Frank Iorfino,Haley M. LaMonica
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Conversational generative artificial intelligence agents (or genAI chatbots) could benefit youth mental health, yet young people’s perspectives remain underexplored. We examined the Mental health Intelligence Agent (Mia), a genAI chatbot originally designed for professionals in Australian youth services. Following co-design, 32 young people participated in online workshops exploring their perceptions of genAI chatbots in youth mental health and to develop recommendations for reconceptualising Mia for consumers and integrating it into services. Four themes were developed: (1) Humanising AI without dehumanising care, (2) I need to know what’s under the hood, (3) Right tool, right place, right time?, and (4) Making it mine on safe ground. This study offers insights into young people’s attitudes, needs, and requirements regarding genAI chatbots in youth mental health, with key implications for service integration. Additionally, by co-designing system requirements, this work informs the ethics, design, development, implementation, and governance of genAI chatbots in youth mental health contexts.
[HC-12] Does the TalkMoves Codebook Generalize to One-on-One Tutoring and Multimodal Interaction?
【速读】:该论文旨在解决现有基于Accountable Talk理论的TalkMoves代码本在应用于一对一辅导场景时的泛化能力不足问题,特别是其在多模态数据(如音频、聊天文本和视频)中的可靠性、实用性与可解释性是否仍能保持。研究发现,尽管TalkMoves在人工标注中表现出较高的信度(k=0.74),但其对辅导相关话语行为的覆盖范围有限,且难以准确识别非语言及多模态交互中的关键动作;相较之下,由AI与人类共同开发的混合代码本虽信度稍低(k=0.64),却展现出更广的实证覆盖能力和更高的可用性感知。因此,解决方案的关键在于开发一种以辅导实践为基础、具备模态感知能力的新一代代码本,以提升对多模态 tutoring 互动的精准建模与分析能力。
链接: https://arxiv.org/abs/2604.13380
作者: Corina Luca Focsan,Marie Cynthia Abijuru Kamikazi,Tamisha Thompson,Jennifer St. John,Kirk Vanacore,Danielle R. Thomas,Kenneth R. Koedinger,René F. Kizilcec
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Accountable Talk theory has been widely adopted to analyze classroom discourse and is increasingly used to annotate tutoring interactions. In particular, the TalkMoves codebook, grounded in Accountable Talk theory, is commonly used to label tutoring data and train models of effective instructional support. However, Accountable Talk was originally developed to characterize collaborative, whole-classroom oral discourse, not to identify talk moves in one-on-one tutoring environments using multimodal data (e.g., video, audio, chat). As tutoring platforms expand in scale and modality, questions remain about whether Accountable Talk-based codebooks generalize reliably beyond their original classroom context and data representation. This study examines whether the human-developed TalkMoves codebook generalizes in reliability, utility, and interpretability when applied to one-on-one tutoring across audio, chat, and multimodal data. We compare TalkMoves with a hybrid AI-human developed codebook using a workflow established in prior research. Two expert annotators with over 20 years of teaching experience applied both codebooks to six tutoring sessions spanning three modalities: chat-based, audio-only, and multimodal interactions. Results show that while Talk-Moves achieved higher overall inter-rater reliability than the AI-human codebook (k = 0.74 vs. 0.64), the AI-human codebook demonstrated broader empirical coverage and higher perceived usability across modalities. Both codebooks undercaptured tutoring-relevant moves and introduced ambiguity when identifying actions expressed through nonverbal and multimodal artifacts. Together, these findings highlight the uneven generalizability of TalkMoves to tutoring contexts and motivate the development of modality-aware, tutoring-grounded codebooks.
[HC-13] Lessons from Skill Development Programs – Livelihood College of Dhamtari
【速读】:该论文旨在解决印度职业技能培训项目中存在的实践障碍,特别是围绕动员、咨询、培训等关键环节的信息流通不畅、数字工具利用率低以及性别包容性不足等问题。研究发现,当前技能发展体系缺乏有效的机构机制来促进利益相关者之间的信息共享,导致信息瓶颈和效率低下。解决方案的关键在于:一是通过整合职业培训合作伙伴(Vocational Training Partners, VTPs)提升培训可及性;二是优化现有数字资产的使用策略以增强培训管理效能;三是改进人工咨询流程,减少冗余并提高专业支持能力。这些措施共同指向构建更高效、公平且技术赋能的职业技能培训生态系统。
链接: https://arxiv.org/abs/2604.13317
作者: Arnab Paul Choudhury,Nihal Patel
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 24 pages, 8 figures, 3 tables. This is the accepted manuscript of a chapter published in Springer Nature, Communications in Computer and Information Science, CCIS, volume 2337, 2025. Indian Conference on Human Computer Interaction Design and Research, IndiaHCI 2024
Abstract:Skill training is crucial for enabling dignified livelihood opportunities. In India, various schemes and initiatives aim to provide skill training in different domains, with ICT and digital technologies playing a vital role. However, there is limited research on understanding on-ground capacities \ constraints and the use of digital tools in these programs. In this study, we look into the mobilization, counseling, and training stages of the 5-stage skill development process that also includes placement and tracking, adopted in Dhamtari’s Livelihood College in Chhattisgarh, India, and other programs nationwide. Through the immersion/crystallization approach and mixed-method analysis including GIS mapping, video analysis of CCTV streams, quantitative analysis, and unstructured conversations with administrators, trainers, mobilizers, counselors, and nearby industry personnel for over a year, we identified three major challenges. A lack of inclusive and gendered access to skilling; a tedious manual counseling process with insufficient support staff; and inconsistent trainee attendance alongside sub-standard utilization of digital assets. Finally, we discuss, ways to improve access to skill training by leveraging Vocational Training Partners(VTPs), ways to improve the utilization of existing digital assets, and considerations for improving the counseling process. We conclude by summarizing that skill development programs currently lack institutional elements that enable effective information exchange between stakeholders, thereby creating information bottlenecks that result in inefficiencies, hindering the service delivery. In sum, our study informs the HCI and ICTD literature on the on-ground challenges and constraints faced by stakeholders and the role of technology in supporting such initiatives.
[HC-14] Lazy or Efficient? Towards Accessible Eye-Tracking Event Detection Using LLM s
【速读】:该论文旨在解决眼动追踪(eye-tracking)研究中因传统检测流程依赖专业编程技能和复杂数据预处理而导致的可及性低、使用门槛高的问题。现有方法如I-VT和I-DT虽有效,但对参数设置和预处理敏感,限制了其在非专业实验室的应用。解决方案的关键在于提出一种无需编码的大型语言模型(large language model, LLM)驱动的端到端分析流水线:通过自然语言指令自动推断原始眼动数据结构、生成可执行的数据清洗与检测代码、完成注视点(fixation)和扫视(saccade)标注,并输出结果报告与解释,支持用户通过修改提示词迭代优化分析流程。该框架显著降低了技术门槛,同时保持与传统方法相当的准确性。
链接: https://arxiv.org/abs/2604.13243
作者: Dongyang Guo,Yasmeen Abdrabou,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Gaze event detection is fundamental to vision science, human-computer interaction, and applied analytics. However, current workflows often require specialized programming knowledge and careful handling of heterogeneous raw data formats. Classical detectors such as I-VT and I-DT are effective but highly sensitive to preprocessing and parameterization, limiting their usability outside specialized laboratories. This work introduces a code-free, large language model (LLM)-driven pipeline that converts natural language instructions into an end-to-end analysis. The system (1) inspects raw eye-tracking files to infer structure and metadata, (2) generates executable routines for data cleaning and detector implementation from concise user prompts, (3) applies the generated detector to label fixations and saccades, and (4) returns results and explanatory reports, and allows users to iteratively optimize their code by editing the prompt. Evaluated on public benchmarks, the approach achieves accuracy comparable to traditional methods while substantially reducing technical overhead. The framework lowers barriers to entry for eye-tracking research, providing a flexible and accessible alternative to code-intensive workflows.
[HC-15] Inclusive Kitchen Design for Older Adults: Generative AI Visualizations to Support Mild Cognitive Impairment
【速读】:该论文旨在解决轻度认知障碍(Mild Cognitive Impairment, MCI)老年人在厨房环境中因空间设计不合理而导致的独立生活困难问题,尤其关注低收入群体缺乏专业设计支持的现实困境。解决方案的关键在于开发一种基于生成式AI的厨房改造系统,利用Stable Diffusion模型结合DreamBooth LoRA和ControlNet技术,在仅100张厨房图像训练基础上,将普通厨房照片转化为符合《家庭设计指南》(Home Design Guidelines, HDG)的MCI友好型布局,包括开放式结构、透明橱柜、优化照明、防滑地面和减少杂乱等特征,从而实现高语义一致性(CLIP分数0.69–0.79)与良好视觉真实感(GIQA分数0.45–0.65),并通过用户调研验证其显著提升认知友好性与实用性。
链接: https://arxiv.org/abs/2604.13203
作者: Ibrahim Bilau,Nicole Li,Terrence Malayvong,Eunhwa Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures, 5 tables, IAFOR Agen2026 Conference Proceedings
Abstract:Mild Cognitive Impairment (MCI) affects 15-20% of adults aged 65 and older, often making kitchen navigation and independent living difficult, particularly in lower-income communities with limited access to professional design help. This study created an AI system that converts standard kitchen photos into MCI-friendly designs using the Home Design Guidelines (HDG). Stable Diffusion models, enhanced with DreamBooth LoRA and ControlNet, were trained on 100 kitchen images to produce realistic visualizations with open layouts, transparent cabinetry, better lighting, non-slip flooring, and less clutter. The models achieved moderate to high semantic alignment (normalized CLIP scores 0.69-0.79) and improved visual realism (GIQA scores 0.45-0.65). In a survey of 33 participants (51.5% caregivers, 36.4% older adults with MCI), the AI-modified kitchens were strongly preferred as more cognitively friendly (87.4% of 198 choices, p .001). Participants reported high confidence in their kitchen choice selections (M = 5.92/7) and found the visualizations very helpful for home modifications (M = 6.27/7). Thematic analysis emphasized improved visibility, lower cognitive load, and greater independence. Overall, this AI tool provides a low-cost, scalable way for older adults and caregivers to visualize and implement DIY kitchen changes, supporting aging in place and resilience for those with MCI.
[HC-16] From Seeing it to Experiencing it: Interactive Evaluation of Intersectional Voice Bias in Human-AI Speech Interaction
【速读】:该论文旨在解决生成式语音大模型(SpeechLLMs)在处理不同口音和声音性别特征时存在的偏见问题,特别是现有评估方法难以捕捉端到端语音交互中的服务质量差异与内容层面偏见,以及用户实际体验的局限性。其解决方案的关键在于提出一种两阶段评估框架:第一阶段采用无裁判的提示-响应指标对六种口音与两种性别呈现组合的受控测试集进行量化分析,识别出口音与性别交叉效应导致的对齐度和冗余度差异;第二阶段通过语音转换技术让用户亲身体验相同语义内容在不同声音身份下的交互效果,发现语音转换可提升对良性回复的信任感和可接受性,并促进视角代入。该方法不仅揭示了语音偏见的多维表现形式,也为评估语音对话人工智能系统的公平性提供了更全面的工具。
链接: https://arxiv.org/abs/2604.13067
作者: Shree Harsha Bokkahalli Satish,Maria Teleki,Christoph Minixhofer,Ondrej Klejch,Peter Bell,Éva Székely
机构: KTH Royal Institute of Technology (皇家理工学院); Texas AM University (德州农工大学); University of Edinburgh (爱丁堡大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 6 pages, 3 figures, 1 table, Accepted to CHI Extended Abstracts Poster session 2026
Abstract:SpeechLLMs process spoken language directly from audio, but accent and vocal identity cues can lead to biased behaviour. Current bias evaluations often miss how such bias manifests in end-to-end speech interactions and how users experience it. We distinguish quality-of-service disparities (e.g., off-topic or low-effort responses) from content-level bias in coherent outputs, and examine intersectional effects of accent and perceived gender. In this work, we explore a two-part evaluation approach: (1) a controlled test cohort spanning six accents and two gender presentations, analysed with judge-free prompt-response metrics, and (2) an interactive study design using voice conversion to let users experience identical content through different vocal identities. Across two studies (Interactive, N=24; Observational, N=19), we find that voice conversion increases trust and acceptability for benign responses and encourages perspective-taking, while automated analysis in search of quality-of-service disparities, reveals accent x gender disparities in alignment and verbosity across SpeechLLMs. These results highlight voice conversion for probing and experiencing intersectional voice bias while our evaluation suite provides richer bias evaluations for spoken conversational AI.
[HC-17] owards Green Wearable Computing: A Physics-Aware Spiking Neural Network for Energy-Efficient IMU-based Human Activity Recognition
【速读】:该论文旨在解决可穿戴惯性测量单元(Inertial Measurement Unit, IMU)-based人体活动识别(Human Activity Recognition, HAR)中深度神经网络(Deep Neural Networks, DNNs)因计算与存储开销大、功耗高而难以部署于电池受限边缘设备的问题。其核心解决方案是提出一种物理感知脉冲神经网络(Physics-Aware Spiking Neural Network, PAS-Net),关键创新在于:空间上引入自适应对称拓扑混合器以嵌入人体关节的物理约束,时间上设计O(1)记忆因果神经调制器实现上下文感知的动态阈值神经元,从而有效缓解非平稳运动节奏下的时序梯度退化问题;同时,通过时序脉冲误差目标函数支持灵活的早期退出机制,在保证准确率的前提下显著降低动态能耗(最高达98%),实现了无需乘法器的稀疏整数累加运算,为持续运行的可穿戴传感系统建立了超低功耗类脑计算标准。
链接: https://arxiv.org/abs/2604.10458
作者: Naichuan Zheng,Hailun Xia,Zepeng Sun,Weiyi Li,Yinze Zhou
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Wearable IMU-based Human Activity Recognition (HAR) relies heavily on Deep Neural Networks (DNNs), which are burdened by immense computational and buffering demands. Their power-hungry floating-point operations and rigid requirement to process complete temporal windows severely cripple battery-constrained edge devices. While Spiking Neural Networks (SNNs) offer extreme event-driven energy efficiency, standard architectures struggle with complex biomechanical topologies and temporal gradient degradation. To bridge this gap, we propose the Physics-Aware Spiking Neural Network (PAS-Net), a fully multiplier-free architecture explicitly tailored for Green HAR. Spatially, an adaptive symmetric topology mixer enforces human-joint physical constraints. Temporally, an O(1) -memory causal neuromodulator yields context-aware dynamic threshold neurons, adapting actively to non-stationary movement rhythms. Furthermore, we leverage a temporal spike error objective to unlock a flexible early-exit mechanism for continuous IMU streams. Evaluated across seven diverse datasets, PAS-Net achieves state-of-the-art accuracy while replacing dense operations with sparse 0.1 pJ integer accumulations. Crucially, its confidence-driven early-exit capability drastically reduces dynamic energy consumption by up to 98%. PAS-Net establishes a robust, ultra-low-power neuromorphic standard for always-on wearable sensing.
[HC-18] Skill-informed Data-driven Haptic Nudges for High-dimensional Human Motor Learning
【速读】:该论文旨在解决在高维冗余运动空间中学习新运动任务时,如何设计最优触觉引导(haptic nudge)反馈以提升学习效率的问题。其核心挑战在于如何根据学习者当前估计的技能状态(skill state)动态调整反馈策略,从而有效引导其向更优的技能状态演化。解决方案的关键在于构建一个基于输入-输出隐马尔可夫模型(Input-Output Hidden Markov Model, IOHMM)的人类运动学习动力学模型,该模型能显式分离潜在技能演化与可观测性能指标;进而将反馈设计问题建模为部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),从而推导出最小化长期性能代价的最优引导策略。实验验证表明,该方法显著提升了运动效率和终点精度,并加速了低维运动协同模式的发现。
链接: https://arxiv.org/abs/2603.12583
作者: Ankur Kamboj,Rajiv Ranganathan,Xiaobo Tan,Vaibhav Srivastava
机构: Michigan State University (密歇根州立大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注:
Abstract:In this work, we propose a data-driven framework to design optimal haptic nudge feedback leveraging the learner’s estimated skill to address the challenge of learning a novel motor task in a high-dimensional, redundant motor space. A nudge is a series of vibrotactile feedback delivered to the learner to encourage motor movements that aid in task completion. We first model the stochastic dynamics of human motor learning under haptic nudges using an Input-Output Hidden Markov Model (IOHMM), which explicitly decouples latent skill evolution from observable performance measures. Leveraging this predictive model, we formulate the haptic nudge feedback design problem as a Partially Observable Markov Decision Process (POMDP). This allows us to derive an optimal nudging policy that minimizes long-term performance cost and implicitly guides the learner toward superior skill states. We validate our approach through a human participant study (N=30) involving a high-dimensional motor task rendered through a hand exoskeleton. Results demonstrate that participants trained with the POMDP-derived policy exhibit significantly accelerated movement efficiency and endpoint accuracy compared to groups receiving heuristic-based feedback or no feedback. Furthermore, synergy analysis reveals that the POMDP group discovers efficient low-dimensional motor representations more rapidly.
计算机视觉
[CV-0] One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
【速读】:该论文旨在解决长视频理解中因帧数过多导致视觉-语言模型(VLM)难以有效利用上下文信息的问题,特别是受限于大语言模型(LLM)的有限上下文长度时,传统方法通常对视频帧进行稀疏采样,从而造成时间信息丢失。其解决方案的关键在于提出一种极端视频标记压缩策略,包括两个核心模块:一是逐标记压缩(LP-Comp),通过可学习且渐进式的压缩机制替代启发式压缩方法,实现每帧仅保留一个标记(token per frame),显著提升帧密度;二是帧级压缩(QC-Comp),基于LLM内部注意力分数选择与查询相关的关键帧,实现问题引导的帧选择。此外,为缓解LLM在长序列中的位置偏置问题(即过度关注序列首尾),论文采用分段局部注意力机制,进一步优化压缩效率与性能。整体上,该方案实现了更高的压缩比和更密集的帧采样能力,同时仅需2.5%的监督微调数据即可显著提升多个长视频基准上的准确率。
链接: https://arxiv.org/abs/2604.14149
作者: Zheyu Zhang,Ziqi Pang,Shixing Chen,Xiang Hao,Vimal Bhat,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon Prime Video (亚马逊Prime视频)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emphone token per frame at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emphlearnable and \emphprogressive modules for \emphtoken-level compression (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emphframe-level compression, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emphquestion-conditioned compression (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emphi.e., the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emphtoken-level and \emphframe-level leads to an e\textbfxtreme compression model for long video understanding, named \textbf\name, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emphsupervised compression tuning stage that only requires 2.5% of the supervised fine-tuning data, yet boosts the accuracy from 42.9% to 46.2% on LVBench and enhances multiple other long video benchmarks.
[CV-1] Seedance 2.0: Advancing Video Generation for World Complexity
【速读】:该论文旨在解决多模态音频视频联合生成任务中模型架构效率低、支持输入模态有限以及生成质量与速度难以兼顾的问题。解决方案的关键在于提出一种统一、高效且大规模的架构设计,使Seedance 2.0能够原生支持文本、图像、音频和视频四种输入模态,并集成业界最全面的多模态内容参考与编辑能力,从而在视频和音频生成的多个子维度上实现显著提升,同时通过提供Fast版本优化低延迟场景下的生成速度,显著增强用户的创作体验。
链接: https://arxiv.org/abs/2604.14148
作者: Team Seedance,De Chen,Liyang Chen,Xin Chen,Ying Chen,Zhuo Chen,Zhuowei Chen,Feng Cheng,Tianheng Cheng,Yufeng Cheng,Mojie Chi,Xuyan Chi,Jian Cong,Qinpeng Cui,Fei Ding,Qide Dong,Yujiao Du,Haojie Duanmu,Junliang Fan,Jiarui Fang,Jing Fang,Zetao Fang,Chengjian Feng,Yu Gao,Diandian Gu,Dong Guo,Hanzhong Guo,Qiushan Guo,Boyang Hao,Hongxiang Hao,Haoxun He,Jiaao He,Qian He,Tuyen Hoang,Heng Hu,Ruoqing Hu,Yuxiang Hu,Jiancheng Huang,Weilin Huang,Zhaoyang Huang,Zhongyi Huang,Jishuo Jin,Ming Jing,Ashley Kim,Shanshan Lao,Yichong Leng,Bingchuan Li,Gen Li,Haifeng Li,Huixia Li,Jiashi Li,Ming Li,Xiaojie Li,Xingxing Li,Yameng Li,Yiying Li,Yu Li,Yueyan Li,Chao Liang,Han Liang,Jianzhong Liang,Ying Liang,Wang Liao,J. H. Lien,Shanchuan Lin,Xi Lin,Feng Ling,Yue Ling,Fangfang Liu,Jiawei Liu,Jihao Liu,Jingtuo Liu,Shu Liu,Sichao Liu,Wei Liu,Xue Liu,Zuxi Liu,Ruijie Lu,Lecheng Lyu,Jingting Ma,Tianxiang Ma,Xiaonan Nie,Jingzhe Ning,Junjie Pan,Xitong Pan,Ronggui Peng,Xueqiong Qu,Yuxi Ren,Yuchen Shen,Guang Shi,Lei Shi,Yinglong Song,Fan Sun,Li Sun,Renfei Sun,Wenjing Tang,Boyang Tao,Zirui Tao,Dongliang Wang,Feng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Seedance 2.0 Model Card
Abstract:Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.
[CV-2] ROSE: Retrieval-Oriented Segmentation Enhancement CVPR2026
【速读】:该论文旨在解决基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的分割模型在面对新出现实体(novel entities)和新兴实体(emerging entities)时性能下降的问题,前者因未出现在训练数据中而无法识别,后者虽在模型知识范围内但需依赖实时外部信息才能准确识别。解决方案的关键在于提出一个名为ROSE(Retrieval-Oriented Segmentation Enhancement)的即插即用增强框架,其核心由四个模块构成:1)互联网检索增强生成模块用于获取实时网络信息;2)文本提示增强模块注入最新知识以提升对新兴实体的感知能力;3)视觉提示增强模块通过网络图像补偿模型对新实体的视觉缺失;4)WebSense模块智能判断是否触发检索机制以保障效率。实验表明,ROSE在NEST基准上显著优于基线模型,gIoU指标提升达19.2。
链接: https://arxiv.org/abs/2604.14147
作者: Song Tang,Guangquan Jie,Henghui Ding,Yu-Gang Jiang
机构: Fudan University, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings, Project Page: this https URL
Abstract:Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model’s knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model’s perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs’ lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.
[CV-3] Geometric Context Transformer for Streaming 3D Reconstruction
【速读】:该论文旨在解决流式三维重建(Streaming 3D Reconstruction)中的三大核心挑战:几何精度、时间一致性以及计算效率。现有方法在长序列处理中常面临坐标漂移、冗余状态存储和推理速度慢等问题。其解决方案的关键在于提出 LingBot-Map,一个基于几何上下文变换器(Geometric Context Transformer, GCT)的前馈式三维基础模型,通过精心设计的注意力机制实现高效稳定重建:该机制包含锚点上下文(anchor context)用于坐标定位、姿态参考窗口(pose-reference window)捕捉密集几何线索、轨迹记忆(trajectory memory)校正长程漂移,从而在保持紧凑状态的同时保留丰富的几何上下文信息,在518×378分辨率下实现约20 FPS的稳定推理性能,并在多个基准测试中超越现有流式与迭代优化方法。
链接: https://arxiv.org/abs/2604.14141
作者: Lin-Zhuo Chen,Jian Gao,Yihang Chen,Ka Leong Cheng,Yipengjing Sun,Liangxiao Hu,Nan Xue,Xing Zhu,Yujun Shen,Yao Yao,Yinghao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL
Abstract:Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches. Comments: Project page: this https URL Code: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.14141 [cs.CV] (or arXiv:2604.14141v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.14141 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yinghao Xu [view email] [v1] Wed, 15 Apr 2026 17:58:13 UTC (19,761 KB) Full-text links: Access Paper: View a PDF of the paper titled Geometric Context Transformer for Streaming 3D Reconstruction, by Lin-Zhuo Chen and 10 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-4] Dont Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
【速读】:该论文旨在解决音频-视觉语言模型(Audio-Visual Language Models, AVLMs)中存在的跨模态幻觉问题,特别是视频驱动的音频幻觉(video-driven audio hallucination)——即模型过度依赖视觉线索生成预期声音,而忽略真实音频信息。解决方案的关键在于提出一种双轴偏好学习框架:Audio-Contrastive Preference Optimization (ACPO),其包含两个核心机制:一是输出对比目标(output-contrastive objective),用于惩罚将视觉描述伪装成音频事实的行为;二是输入对比目标(input-contrastive objective),通过交换音频轨道来显式惩罚对真实声学信号不敏感的生成行为。实验表明,ACPO能显著提升音频接地准确性并有效抑制音频幻觉,同时保持模型整体多模态能力。
链接: https://arxiv.org/abs/2604.14129
作者: Ami Baid,Zihui Xue,Kristen Grauman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.
[CV-5] HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
【速读】:该论文旨在解决端到端视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中因微调窄域控制数据而导致其源自基础视觉-语言模型(Vision-Language Model, VLM)的深层推理能力被削弱的根本性权衡问题。解决方案的关键在于提出一种以视觉接地(visual grounding)为中心的分层框架HiVLA,该框架显式地将高层语义规划与底层运动控制解耦:高层部分由VLM规划器完成任务分解和视觉接地,生成包含子任务指令和精确目标边界框的结构化计划;低层部分引入基于流匹配(flow-matching)的扩散Transformer(Diffusion Transformer, DiT)动作专家,并设计新颖的级联交叉注意力机制,依次融合全局上下文、高分辨率对象中心裁剪和技能语义信息,使DiT专注于鲁棒执行。这种解耦架构既保留了VLM的零样本推理能力,又支持各组件独立优化,从而显著提升长程技能组合与杂乱场景中小物体精细操作性能。
链接: https://arxiv.org/abs/2604.14125
作者: Tianshuo Yang,Guanyu Chen,Yutian Chen,Zhixuan Liang,Yitian Liu,Zanxin Chen,Chunpu Xu,Haotian Liang,Jiangmiao Pang,Yao Mu,Ping Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project Page: this https URL
Abstract:While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM’s zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
[CV-6] raining-Free Semantic Multi-Object Tracking with Vision-Language Models
【速读】:该论文旨在解决语义多目标跟踪(Semantic Multi-Object Tracking, SMOT)系统依赖端到端训练所带来的高标注成本与适应性差的问题,尤其在面对新基础模型或新交互关系时难以快速迭代。其核心解决方案是提出一种无需训练的SMOT流水线TF-SMOT,关键在于通过组合预训练模块实现高效、灵活的语义追踪:利用D-FINE与可提示的SAM2分割跟踪器生成时序一致的轨迹片段(tracklets),借助轮廓定位(contour grounding)技术结合InternVideo2.5生成视频摘要和实例级描述,最后通过基于词典的语义检索与大语言模型(LLM)消歧方法将提取的交互谓词对齐至BenSMOT WordNet同义词集(synsets)。此设计显著提升了跟踪性能及描述质量,同时为后续研究提供了可插拔、低依赖的架构范式。
链接: https://arxiv.org/abs/2604.14074
作者: Laurence Bonat,Francesco Tonini,Elisa Ricci,Lorenzo Vaquero
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)
Abstract:Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.
[CV-7] owards Unconstrained Human-Object Interaction
【速读】:该论文旨在解决传统人类-物体交互(Human-Object Interaction, HOI)检测模型依赖预定义交互词汇表的问题,该限制使其难以适应开放场景中的动态交互识别。为应对这一挑战,作者提出了无约束人类-物体交互(Unconstrained HOI, U-HOI)任务,其在训练和推理阶段均无需预先指定交互类别。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs),构建包含测试时推理与语言到图结构转换的流水线,从而从自由文本中提取结构化的交互信息,实现对开放域中任意交互关系的识别与建模。
链接: https://arxiv.org/abs/2604.14069
作者: Francesco Tonini,Alessandro Conti,Lorenzo Vaquero,Cigdem Beyan,Elisa Ricci
机构: University of Trento(特伦托大学); Fondazione Bruno Kessler(布鲁诺·凯勒基金会); University of Verona(维罗纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)
Abstract:Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from free-form text. Our findings highlight the limitations of current HOI detectors and the value of MLLMs for U-HOI. Code will be available at this https URL
[CV-8] OneHOI: Unifying Human-Object Interaction Generation and Editing CVPR2026
【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)建模中生成与编辑任务割裂的问题:现有方法分为两类——HOI生成侧重于从结构化三元组和布局合成场景,但难以融合包含纯物体实体的混合条件;而HOI编辑虽可通过文本控制修改交互关系,却难以解耦姿态与物理接触,并且难以扩展至多交互场景。其解决方案的核心是提出OneHOI框架,这是一个统一的扩散Transformer架构,通过共享的结构化交互表示将HOI生成与编辑整合为单一的条件去噪过程。关键创新在于引入关系扩散Transformer(Relational Diffusion Transformer, R-DiT),它利用角色和实例感知的HOI token建模动词驱动的关系、基于布局的空间动作定位(Action Grounding)、结构化HOI注意力机制以强制交互拓扑约束,并采用HOI RoPE(Rotary Positional Encoding)实现多HOI场景的解耦表示,从而在多种控制条件下(如布局引导、任意掩码、混合条件等)均达到最优性能。
链接: https://arxiv.org/abs/2604.14062
作者: Jiun Tian Hoe,Weipeng Hu,Xudong Jiang,Yap-Peng Tan,Chee Seng Chan
机构: Nanyang Technological University (南洋理工大学); Sun Yat-sen University (中山大学); Universiti Malaya (马来亚大学); VinUniversity (Vin大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at CVPR2026. This paper moves toward unifying HOI generation and editing within a single model
Abstract:Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as person, action, object triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at this https URL.
[CV-9] Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself
【速读】:该论文旨在解决前馈式3D重建模型在测试阶段缺乏适应能力的问题,即这些模型一旦训练完成便以零样本(zero-shot)方式执行推理,无法根据具体测试场景进行调整,导致在遮挡、高光和模糊线索等复杂情况下生成存在误差的重建结果。解决方案的关键在于提出名为Free Geometry的框架,其核心思想是利用模型在接收更多视角时能产生更可靠且视图一致的重建这一特性,通过在测试序列中掩码部分帧构建自监督任务:强制完整观测与部分观测之间的跨视图特征一致性,同时保持被掩码帧所隐含的成对关系。这种自监督机制使得模型可通过轻量级LoRA(Low-Rank Adaptation)更新实现快速再校准,在单个GPU上每数据集耗时不足2分钟,显著提升了相机位姿估计和点云地图预测性能。
链接: https://arxiv.org/abs/2604.14048
作者: Yuhang Dai,Xingyi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
Abstract:Feed-forward 3D reconstruction models are efficient but rigid: once trained, they perform inference in a zero-shot manner and cannot adapt to the test scene. As a result, visually plausible reconstructions often contain errors, particularly under occlusions, specularities, and ambiguous cues. To address this, we introduce Free Geometry, a framework that enables feed-forward 3D reconstruction models to self-evolve at test time without any 3D ground truth. Our key insight is that, when the model receives more views, it produces more reliable and view-consistent reconstructions. Leveraging this property, given a testing sequence, we mask a subset of frames to construct a self-supervised task. Free Geometry enforces cross-view feature consistency between representations from full and partial observations, while maintaining the pairwise relations implied by the held-out frames. This self-supervision allows for fast recalibration via lightweight LoRA updates, taking less than 2 minutes per dataset on a single GPU. Our approach consistently improves state-of-the-art foundation models, including Depth Anything 3 and VGGT, across 4 benchmark datasets, yielding an average improvement of 3.73% in camera pose accuracy and 2.88% in point map prediction. Code is available at this https URL .
[CV-10] Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在遥感变化理解任务中面临的“时间盲区”问题,即现有架构缺乏内在的多时相对比推理机制和精确的空间定位能力。解决方案的关键在于提出Delta-LLaVA框架,其核心创新包括:1)Change-Enhanced Attention模块,系统性地分离并增强视觉差异特征;2)Change-SEG模块,利用Change Prior Embedding提取可微分的差异特征作为大语言模型(Large Language Model, LLM)输入;3)Local Causal Attention机制,防止跨时相上下文信息泄露。这些设计共同提升了模型在复杂变化推理与高精度边界定位上的性能。
链接: https://arxiv.org/abs/2604.14044
作者: Xiaohe Li,Jiahao Li,Kaixin Zhang,Yuqiang Fang,Leilei Lin,Hong Wang,Haohua Wu,Zide Fan
机构: Beijing China Aerospace Information Research Institute, CAS (中国科学院空天信息研究院); Beijing China Space Engineering University (北京空间工程大学); Beijing China Capital Normal University (北京首都师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental “temporal blindness”. Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.
[CV-11] Seek-and-Solve: Benchmarking MLLM s for Visual Clue-Driven Reasoning in Daily Scenarios
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在日常场景中视觉推理能力评估不足的问题,尤其关注模型能否从丰富的视觉信息中识别并利用关键视觉线索进行深层次推理,而非仅依赖表面感知或已有知识。解决方案的关键在于构建一个名为DailyClue的基准测试集,其核心设计原则为:(1)严格基于真实日常活动场景以保证任务的真实性与实用性;(2)设计具有挑战性的查询问题,迫使模型超越简单的图像识别,主动寻找并运用合适的视觉线索完成推理任务。该基准涵盖四个主要日常领域和16个子任务,通过系统性评估表明,准确识别视觉线索是实现鲁棒推理的前提条件。
链接: https://arxiv.org/abs/2604.14041
作者: Xiaomin Li,Tala Wang,Zichen Zhong,Ying Zhang,Zirui Zheng,Takashi Isobe,Dezhuang Li,Huchuan Lu,You He,Xu Jia
机构: Dalian University of Technology; WeChat, Tencent Inc.; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs’ pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.
[CV-12] POINTS-Seeker: Towards Training a Multimodal Agent ic Search Model from Scratch
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)因静态参数化知识而存在的认知局限性问题,特别是在长周期交互中由于交互历史累积导致的证据定位困难。其核心解决方案在于构建一个从零开始设计的多模态智能体搜索模型(multimodal agentic search model),关键创新包括:(i) 提出“智能体种子”(Agentic Seeding)阶段以激发智能体行为的基础前驱;(ii) 发现长周期交互中的性能瓶颈并提出V-Fold机制——一种自适应的历史感知压缩方案,通过渲染将历史上下文折叠进视觉空间以保留近期对话高保真度;最终开发出POINTS-Seeker-8B模型,在六个多样化基准上显著优于现有方法,有效提升了知识密集型视觉推理能力。
链接: https://arxiv.org/abs/2604.14029
作者: Yikun Liu,Yuan Liu,Le Tian,Xiao Zhou,Jiangchao Yao,Yanfeng Wang,Weidi Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model’s ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.
[CV-13] Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
【速读】:该论文旨在解决传统3D重建方法在实际部署和可扩展性方面的局限性,即依赖于缓慢的逐场景优化或类别特定训练,难以实现高效且通用的3D重建。其解决方案的关键在于提出一种以模型设计策略为核心的新型分类体系(taxonomy),该体系不依赖于具体的几何输出表示形式(如隐式场或显式几何体),而是聚焦于五类核心问题:特征增强、几何感知、模型效率、数据增强策略与时间感知建模,从而统一并系统化当前前馈式3D重建方法的研究路径,并为未来研究提供结构化指导。
链接: https://arxiv.org/abs/2604.14025
作者: Weijie Wang,Qihang Cao,Sensen Gao,Donny Y. Chen,Haofei Xu,Wenjing Bian,Songyou Peng,Tat-Jen Cham,Chuanxia Zheng,Andreas Geiger,Jianfei Cai,Jia-Wang Bian,Bohan Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 67 pages, 395 references. Project page: this https URL . Code: this https URL . This work has been submitted to Springer for possible publication
Abstract:Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per-scene optimization or category-specific training, which hinders their practical deployment and scalability. Hence, generalizable feed-forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross-scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed-forward approaches share similar high-level architectural patterns, such as image feature extraction backbones, multi-view information fusion mechanisms, and geometry-aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal-aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real-world applications based on feed-forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.
[CV-14] owards Multi-Object-Tracking with Radar on a Fast Moving Vehicle: On the Potential of Processing Radar in the Frequency Domain
【速读】:该论文旨在解决雷达数据处理中因噪声和结构误差导致的鲁棒性不足问题,尤其是在高动态场景下(如车辆自运动及多个未知移动目标存在时),传统基于特征的方法难以保持精度。其解决方案的关键在于将雷达数据处理从时域迁移至频域,利用频域中的相关性方法(如基于傅里叶变换的SOFT算法)实现更稳健的雷达里程计(radar-odometry),从而在不依赖多传感器融合的情况下,仅通过雷达即可估计运动轨迹,并同时检测场景中所有移动结构,提升系统在复杂自动驾驶场景(如自主赛车超车操作)中的适应能力。
链接: https://arxiv.org/abs/2604.14013
作者: Tim Hansen,Arturo Gomez-Chavez,Ilya Shimchik,Andreas Birk
机构: Constructor University Bremen (康斯特拉克大学不来梅); Constructor Knowledge Labs (CKL) (康斯特拉克知识实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:
Abstract:We promote in this paper the processing of radar data in the frequency domain to achieve higher robustness against noise and structural errors, especially in comparison to feature-based methods. This holds also for high dynamics in the scene, i.e., ego-motion of the vehicle with the sensor plus the presence of an unknown number of other moving objects. In addition to the high robustness, the processing in the frequency domain has the so far neglected advantage that the underlying correlation based methods used for, e.g., registration, provide information about all moving structures in the scene. A typical automotive application case is overtaking maneuvers, which in the context of autonomous racing are used here as a motivating example. Initial experiments and results with Fourier SOFT in 2D (FS2D) are presented that use the Boreas dataset to demonstrate radar-only-odometry, i.e., radar-odometry without sensor-fusion, to support our arguments.
[CV-15] Depth-Aware Image and Video Orientation Estimation
【速读】:该论文旨在解决自然图像与视频中方向估计(orientation estimation)的准确性与鲁棒性问题,尤其在虚拟现实(VR)、增强现实(AR)、自主导航和交互式监控等应用中,传统方法往往因缺乏对深度信息的有效利用而表现不佳。解决方案的关键在于引入基于图像不同象限深度分布的分析机制,并融合深度梯度一致性(Depth Gradient Consistency, DGC)与水平对称性分析(Horizontal Symmetry Analysis, HSA),从而有效利用深度线索提升空间一致性与感知稳定性,实现更精确的方向校正。
链接: https://arxiv.org/abs/2604.13995
作者: Muhammad Z. Alam,Larry Stetsiuk,M. Umair Mukati,Zeeshan Kaleem
机构: University of New Brunswick (新布伦瑞克大学); Brandon University (布兰登大学); Technical University of Denmark (丹麦技术大学); COMSATS University (COMSATS大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures
Abstract:This paper introduces a novel approach for image and video orientation estimation by leveraging depth distribution in natural images. The proposed method estimates the orientation based on the depth distribution across different quadrants of the image, providing a robust framework for orientation estimation suited for applications such as virtual reality (VR), augmented reality (AR), autonomous navigation, and interactive surveillance systems. To further enhance fine-scale perceptual alignment, we incorporate depth gradient consistency (DGC) and horizontal symmetry analysis (HSA), enabling precise orientation correction. This hybrid strategy effectively exploits depth cues to support spatial coherence and perceptual stability in immersive visual content. Qualitative and quantitative evaluations demonstrate the robustness and accuracy of the proposed approach, outperforming existing techniques across diverse scenarios.
[CV-16] Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
【速读】:该论文旨在解决生成式扩散先验(Generative Diffusion Priors)在遥感图像超分辨率(Remote Sensing Image Super-Resolution, RSISR)任务中表现不佳的问题。其核心挑战在于遥感图像具有独特的纹理分布特性:全局随机但局部聚集,导致纹理分布高度不平衡,从而严重削弱模型的空间感知能力。解决方案的关键在于提出TexADiff框架,通过引入相对纹理密度图(Relative Texture Density Map, RTDM)来显式建模纹理分布,并将其以三种协同方式融入扩散过程:作为空间条件引导扩散路径、作为损失调制项优先关注纹理丰富区域、以及作为动态适配器调节采样调度。这一设计使模型具备显式的纹理感知能力,显著提升了重建质量并减少了纹理幻觉,同时在下游任务中取得性能增益。
链接: https://arxiv.org/abs/2604.13994
作者: Enzhuo Zhang,Sijie Zhao,Dilxat Muhtar,Zhenshi Li,Xueliang Zhang,Pengfeng Xiao
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 9 Tables
Abstract:Generative diffusion priors have recently achieved state-of-the-art performance in natural image super-resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to remote sensing image super-resolution (RSISR) reveals significant shortcomings. Unlike natural images, remote sensing images exhibit a unique texture distribution where ground objects are globally stochastic yet locally clustered, leading to highly imbalanced textures. This imbalance severely hinders the model’s spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) to represent the texture distribution. TexADiff then leverages this RTDM in three synergistic ways: as an explicit spatial conditioning to guide the diffusion process, as a loss modulation term to prioritize texture-rich regions, and as a dynamic adapter for the sampling schedule. These modifications are designed to endow the model with explicit texture-aware capabilities. Experiments demonstrate that TexADiff achieves superior or competitive quantitative metrics. Furthermore, qualitative results show that our model generates faithful high-frequency details while effectively suppressing texture hallucinations. This improved reconstruction quality also results in significant gains in downstream task performance. The source code of our method can be found at this https URL.
[CV-17] HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions
【速读】:该论文旨在解决在低质量成像条件下(如弱光、雾霾等导致视觉信息退化)对象检测系统因预测不确定性增加而难以部署的问题,同时克服现有方法在提升图像质量或设计复杂架构时缺乏可解释性且未能有效增强语义区分能力的局限。其解决方案的关键在于提出一种基于分层原型学习(hierarchical prototype learning)的新范式——HiProto,通过构建多层级特征空间中的结构化原型表示来建模类别特定语义,从而提升模型的语义判别力与可解释性。具体而言,核心创新包括:1)区域到原型对比损失(Region-to-Prototype Contrastive Loss, RPC-Loss),强化原型对目标区域的语义聚焦;2)原型正则化损失(Prototype Regularization Loss, PR-Loss),增强类间原型的区分度;3)尺度感知伪标签生成策略(Scale-aware Pseudo Label Generation Strategy, SPLGS),抑制RPC-Loss中不匹配监督信号对低层原型表示的干扰,保持其鲁棒性。实验表明,HiProto在ExDark、RTTS和VOC2012-FOG数据集上实现了竞争力性能,并能通过原型响应提供清晰的可解释性输出。
链接: https://arxiv.org/abs/2604.13981
作者: Jianlin Xiang,Linhui Dai,Xue Yang,Chaolei Yang,Yanshan Li
机构: Shenzhen University (深圳大学); Guangdong Provincial Key Laboratory of Intelligent Information Processing (广东省智能信息处理重点实验室); Shenzhen Key Laboratory of Modern Communications and Information Processing (深圳市现代通信与信息处理重点实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 9 figures
Abstract:Interpretability is essential for deploying object detection systems in critical applications, especially under low-quality imaging conditions that degrade visual information and increase prediction uncertainty. Existing methods either enhance image quality or design complex architectures, but often lack interpretability and fail to improve semantic discrimination. In contrast, prototype learning enables interpretable modeling by associating features with class-centered semantics, which can provide more stable and interpretable representations under degradation. Motivated by this, we propose HiProto, a new paradigm for interpretable object detection based on hierarchical prototype learning. By constructing structured prototype representations across multiple feature levels, HiProto effectively models class-specific semantics, thereby enhancing both semantic discrimination and interpretability. Building upon prototype modeling, we first propose a Region-to-Prototype Contrastive Loss (RPC-Loss) to enhance the semantic focus of prototypes on target regions. Then, we propose a Prototype Regularization Loss (PR-Loss) to improve the distinctiveness among class prototypes. Finally, we propose a Scale-aware Pseudo Label Generation Strategy (SPLGS) to suppress mismatched supervision for RPC-Loss, thereby preserving the robustness of low-level prototype representations. Experiments on ExDark, RTTS, and VOC2012-FOG demonstrate that HiProto achieves competitive results while offering clear interpretability through prototype responses, without relying on image enhancement or complex architectures. Our code will be available at this https URL.
[CV-18] MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images
【速读】:该论文旨在解决标准视觉语言模型在医学影像诊断报告中难以准确关联文本描述与图像中小尺度但关键区域的问题。其核心挑战在于,放射科医生在报告中以高度凝练的语言描述特定解剖区域内的细微病理发现,而现有模型缺乏对这种“局部-语义”映射的精细建模能力。解决方案的关键在于提出一种多任务、多实例视觉语言对齐方法MApLe,通过解耦解剖区域(anatomical region)与诊断发现(diagnostic finding)的概念,采用基于图像块(patch-wise)的编码策略,将局部图像信息与自由文本报告中的句子进行细粒度对齐,从而显著提升模型在下游任务中的对齐性能。
链接: https://arxiv.org/abs/2604.13970
作者: Felicia Bader,Philipp Seeböck,Anastasia Bartashova,Ulrike Attenberger,Georg Langs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for MIDL 2026; Reviews available at this https URL
Abstract:In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose “MApLe”, a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks. The code is available at this https URL.
[CV-19] Heuristic Style Transfer for Real-Time Efficient Weather Attribute Detection
【速读】:该论文旨在解决从RGB图像中高效检测天气条件的问题,包括预测天气类型(如晴天、雨天、雪天、雾天)及11个互补属性(如强度、能见度、地面状况),共涉及53个分类任务。其核心挑战在于如何利用视觉风格变化来表征不同天气状态,并实现轻量化与高精度的多任务学习。解决方案的关键在于引入两种新型轻量级架构:RTM(ResNet50-Truncated-MultiTasks)和PMG(PatchGAN-MultiTasks-Gram),其中PMG通过局部Gram矩阵(Local Gram)捕捉局部风格特征以提升空间一致性,同时结合注意力机制增强多任务协同能力;此外,作者自动化了Gram矩阵计算流程,并将PatchGAN结构集成至监督式多任务学习框架中,显著提升了模型在嵌入式系统中的实时推理性能(参数少于500万,内存占用小),且在零样本迁移测试中仍保持较高准确率(F1 > 78%),验证了其泛化能力。
链接: https://arxiv.org/abs/2604.13947
作者: Hamed Ouattara,Pierre Duthon,Pascal Houssam Salmane,Frédéric Bernardin,Omar Ait Aider
机构: CEREMA(法国国家环境与公共工程研究中心); University of Côte d’Azur (蔚蓝海岸大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 18 figures
Abstract:We present lightweight and efficient architectures to detect weather conditions from RGB images, predicting the weather type (sunny, rain, snow, fog) and 11 complementary attributes such as intensity, visibility, and ground condition, for a total of 53 classes across the tasks. This work examines to what extent weather conditions manifest as variations in visual style. We investigate style-inspired techniques, including Gram matrices, a truncated ResNet-50 targeting lower and intermediate layers, and PatchGAN-style architectures, within a multi-task framework with attention mechanisms. Two families are introduced: RTM (ResNet50-Truncated-MultiTasks) and PMG (PatchGAN-MultiTasks-Gram), together with their variants. Our contributions include automation of Gram-matrix computation, integration of PatchGAN into supervised multi-task learning, and local style capture through local Gram for improved spatial coherence. We also release a dataset of 503,875 images annotated with 12 weather attributes under a Creative Commons Attribution (CC-BY) license. The models achieve F1 scores above 96 percent on our internal test set and above 78 percent in zero-shot evaluation on several external datasets, confirming their generalization ability. The PMG architecture, with fewer than 5 million parameters, runs in real time with a small memory footprint, making it suitable for embedded systems. The modular design of the models also allows style-related or weather-related tasks to be added or removed as needed.
[CV-20] SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation
【速读】:该论文旨在解决传统局部特征匹配方法因受限于特征描述子的局部性,难以捕捉跨视图场景中非局部信息而导致的对应关系不准确问题。其解决方案的关键在于提出SceneGlue框架,通过融合隐式并行注意力机制与显式跨视图可见性估计的混合匹配范式:其中并行注意力机制在图像内和跨图像间同步交换局部描述子的信息,增强场景全局上下文感知;同时引入Visibility Transformer显式地将特征分为可见与不可见区域,提供跨视图场景可见性的理解,从而有效弥补局部描述子的局限性。该方法仅需局部特征匹配进行训练,无需场景级标注,显著提升了匹配精度、鲁棒性和可解释性。
链接: https://arxiv.org/abs/2604.13941
作者: Songlin Du,Xiaoyong Lu,Yaping Yan,Guobao Xiao,Xiaobo Lu,Takeshi Ikenaga
机构: Southeast University (东南大学); Tongji University (同济大学); Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Local feature matching plays a critical role in understanding the correspondence between cross-view images. However, traditional methods are constrained by the inherent local nature of feature descriptors, limiting their ability to capture non-local scene information that is essential for accurate cross-view correspondence. In this paper, we introduce SceneGlue, a scene-aware feature matching framework designed to overcome these limitations. SceneGlue leverages a hybridizable matching paradigm that integrates implicit parallel attention and explicit cross-view visibility estimation. The parallel attention mechanism simultaneously exchanges information among local descriptors within and across images, enhancing the scene’s global context. To further enrich the scene awareness, we propose the Visibility Transformer, which explicitly categorizes features into visible and invisible regions, providing an understanding of cross-view scene visibility. By combining explicit and implicit scene-level awareness, SceneGlue effectively compensates for the local descriptor constraints. Notably, SceneGlue is trained using only local feature matches, without requiring scene-level groundtruth annotations. This scene-aware approach not only improves accuracy and robustness but also enhances interpretability compared to traditional methods. Extensive experiments on applications such as homography estimation, pose estimation, image matching, and visual localization validate SceneGlue’s superior performance. The source code is available at this https URL.
[CV-21] A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology
【速读】:该论文旨在解决宫颈细胞学筛查中 Bethesda 细胞(Bethesda cell)的自动检测问题,以提升 Pap smear 图像分析的准确性和效率。其解决方案的关键在于构建一个融合 YOLO(You Only Look Once)目标检测模型与 U-Net 图像分割架构的集成框架,并引入重叠区域去除技术和二分类器进行后处理优化,从而显著提高检测精度,在 Riva Cytology Challenge 的 mAP50-95 评估指标上达到 0.5909,位列第二。
链接: https://arxiv.org/abs/2604.13939
作者: Martin Amster,Camila María Polotto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISBI 2026 Accepted Paper Second Place Solution for the RIVA Cervical Cytology Challenge Track B
Abstract:Computer vision techniques have advanced significantly in recent years, finding diverse and impactful applications within the medical field. In this paper, we introduce a new framework for the detection of Bethesda cells in Pap smear images, developed for Track B of the Riva Cytology Challenge held in association with the International Symposium on Biomedical Imaging (ISBI). This work focuses on enhancing computer vision models for cell detection, with performance evaluated using the mAP50-95 metric. We propose a solution based on an ensemble of YOLO and U-Net architectures, followed by a refinement stage utilizing overlap removal techniques and a binary classifier. Our framework achieved second place with a mAP50-95 score of 0.5909 in the competition. The implementation and source code are available at the following repository: this http URL
[CV-22] ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
【速读】:该论文旨在解决多主体生成图像中身份保持与姿态结构精确控制之间的根本性冲突问题,即在复杂动作场景下,如何避免身份融合(identity fusion)和姿态失真(pose distortion),从而实现个体身份与姿态结构的解耦。其解决方案的关键在于提出ASTRA(Adaptive Synthesis through Targeted Retrieval Augmentation)框架,通过两个核心机制实现架构级解耦:一是基于检索增强的姿态(Retrieval-Augmented Pose, RAG-Pose)管道,从预构建数据库中提供清晰的结构先验;二是引入增强型通用旋转位置编码(Enhanced Universal Rotary Position Embedding, EURoPE),以非对称方式将身份令牌与空间位置解耦、并将姿态令牌绑定至画布,同时辅以解耦语义调制(Disentangled Semantic Modulation, DSM)适配器,将身份保持任务迁移至文本条件流中,从而显著提升姿态遵循度与身份保真度。
链接: https://arxiv.org/abs/2604.13938
作者: Tianze Xia,Zijian Ning,Zonglin Zhao,Mingjia Wang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion and pose distortion, as appearance and structure signals become entangled within the model’s architecture. To resolve this conflict, we introduce ASTRA(Adaptive Synthesis through Targeted Retrieval Augmentation), a novel framework that architecturally disentangles subject appearance from pose structure within a unified Diffusion Transformer. ASTRA achieves this through a dual-pronged strategy. It first employs a Retrieval-Augmented Pose (RAG-Pose) pipeline to provide a clean, explicit structural prior from a curated database. Then, its core generative model learns to process these dual visual conditions using our Enhanced Universal Rotary Position Embedding (EURoPE), an asymmetric encoding mechanism that decouples identity tokens from spatial locations while binding pose tokens to the canvas. Concurrently, a Disentangled Semantic Modulation (DSM) adapter offloads the identity preservation task into the text conditioning stream. Extensive experiments demonstrate that our integrated approach achieves superior disentanglement. On our designed COCO-based complex pose benchmark, ASTRA achieves a new state-of-the-art in pose adherence, while maintaining high identity fidelity and text alignment in DreamBench.
[CV-23] ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection
【速读】:该论文旨在解决时间序列异常检测(Time-series Anomaly Detection, TSAD)中因异常样本稀少且类型多样、标注数据稀缺而导致的模型训练困难问题。现有方法多依赖于重建或预测机制,难以处理复杂数据模式,或依赖嵌入空间中的异常合成与固定距离度量,需领域专业知识且灵活性不足。其解决方案的关键在于提出ASTERS框架,通过在潜在空间(latent space)中直接生成伪异常样本,避免了人工设计异常注入和对特定领域的依赖;同时利用预训练大语言模型(Large Language Model, LLM)增强潜在空间的时间与上下文表征能力,并结合Transformer结构构建异常分类器,从而实现无需标签数据即可高效学习异常特征,显著提升TSAD性能。
链接: https://arxiv.org/abs/2604.13924
作者: Romain Hermary,Samet Hicsonmez,Dan Pineau,Abd El Rahman Shabayek,Djamila Aouada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.
[CV-24] PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction
【速读】:该论文旨在解决单目RGB视频驱动下人脸动画重建中难以泛化至未见表情且无法捕捉细微面部运动细节的问题。现有方法或仅以形态可变模型(morphable model, MM)参数条件化隐式网络,或学习一个虚构的规范空间辐射场,导致在表达多样性与细节还原能力上的局限。解决方案的关键在于提出一种基于部件的神经辐射场(PartNerFace)框架:首先利用基于参数化头部模型的逆蒙皮(inverse skinning)将观测点映射到规范空间,再通过分部变形场(part-based deformation field)建模细粒度运动;其核心创新是认识到不同面部区域的形变应差异化建模,采用多个局部多层感知机(local MLPs)自适应地将规范空间划分为若干部分,并通过软加权机制聚合各局部预测结果,从而实现对复杂表情和微小运动的高保真重建。
链接: https://arxiv.org/abs/2604.13918
作者: Xianggang Yu,Lingteng Qiu,Xiaohang Ren,Guanying Chen,Shuguang Cui,Xiaoguang Han,Baoyuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present PartNerFace, a part-based neural radiance fields approach, for reconstructing animatable facial avatar from monocular RGB videos. Existing solutions either simply condition the implicit network with the morphable model parameters or learn an imaginary canonical radiance field, making them fail to generalize to unseen facial expressions and capture fine-scale motion details. To address these challenges, we first apply inverse skinning based on a parametric head model to map an observed point to the canonical space, and then model fine-scale motions with a part-based deformation field. Our key insight is that the deformation of different facial parts should be modeled differently. Specifically, our part-based deformation field consists of multiple local MLPs to adaptively partition the canonical space into different parts, where the deformation of a 3D point is computed by aggregating the prediction of all local MLPs by a soft-weighting mechanism. Extensive experiments demonstrate that our method generalizes well to unseen expressions and is capable of modeling fine-scale facial motions, outperforming state-of-the-art methods both quantitatively and qualitatively.
[CV-25] Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model CVPR2025
【速读】:该论文旨在解决视频比特流损坏后的盲恢复问题,即在缺乏预定义损坏区域掩码的情况下,从存储或传输过程中受损的视频中恢复出逼真内容。传统方法依赖人工标注的掩码,但在实际场景中难以实施。解决方案的关键在于提出一种基于元数据引导的扩散模型(Metadata-Guided Diffusion Model, M-GDM),通过双流元数据编码器提取运动矢量和帧类型信息,并融合为统一表示,利用交叉注意力机制与损坏潜在特征交互;同时设计先验驱动的掩码预测器生成伪掩码以分离完整区域与恢复区域,并引入后处理精修模块减少因掩码不完美导致的边界伪影,从而实现高效且高质量的盲视频恢复。
链接: https://arxiv.org/abs/2604.13906
作者: Shuyun Wang,Hu Zhang,Xin Shen,Dadong Wang,Xin Yu
机构: The University of Queensland (昆士兰大学); Data61, CSIRO, Australia (数据61,澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:Bitstream-corrupted video recovery aims to restore realistic content degraded during video storage or transmission. Existing methods typically assume that predefined masks of corrupted regions are available, but manually annotating these masks is labor-intensive and impractical in real-world scenarios. To address this limitation, we introduce a new blind video recovery setting that removes the reliance on predefined masks. This setting presents two major challenges: accurately identifying corrupted regions and recovering content from extensive and irregular degradations. We propose a Metadata-Guided Diffusion Model (M-GDM) to tackle these challenges. Specifically, intrinsic video metadata are leveraged as corruption indicators through a dual-stream metadata encoder that separately embeds motion vectors and frame types before fusing them into a unified representation. This representation interacts with corrupted latent features via cross-attention at each diffusion step. To preserve intact regions, we design a prior-driven mask predictor that generates pseudo masks using both metadata and diffusion priors, enabling the separation and recombination of intact and recovered regions through hard masking. To mitigate boundary artifacts caused by imperfect masks, a post-refinement module enhances consistency between intact and recovered regions. Extensive experiments demonstrate the effectiveness of our method and its superiority in blind video recovery. Code is available at: this https URL.
[CV-26] Rethinking Image-to-3D Generation with Sparse Queries: Efficiency Capacity and Input-View Bias
【速读】:该论文旨在解决图像到三维(image-to-3D)生成任务中传统方法存在的高计算开销与输入视角偏差(input-view bias)问题。现有方法通常依赖密集的体素网格、三平面表示或像素对齐的基元,导致内存占用大且易过度拟合于训练时的条件视图。其解决方案的关键在于提出SparseGen框架,该框架采用一组可学习的稀疏3D锚点查询(sparse 3D anchor queries)结合一个可学习的扩展算子(expansion operator),将每个变换后的查询映射为局部小规模的3D高斯基元集合。这种稀疏集-latent扩展机制在无3D监督条件下通过修正流重建目标进行训练,使模型能够自适应地将表示能力分配至几何和外观关键区域,从而显著降低内存消耗与推理时间,同时保持多视角一致性,并有效缓解输入视角偏差。
链接: https://arxiv.org/abs/2604.13905
作者: Zhiyuan Xu,Jiuming Liu,Yuxin Chen,Masayoshi Tomizuka,Chenfeng Xu,Chensheng Peng
机构: UC Berkeley (加州大学伯克利分校); University of Cambridge (剑桥大学); UT Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
Abstract:We present SparseGen, a novel framework for efficient image-to-3D generation, which exhibits low input-view bias while being significantly faster. Unlike traditional approaches that rely on dense volumetric grids, triplanes, or pixel-aligned primitives, we model scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, our model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity. We introduce quantitative measures of input-view bias and utilization to show that sparse queries reduce overfitting to conditioning views while being representationally efficient. Our results argue that sparse set-latent expansion is a principled, practical alternative for efficient 3D generative modeling.
[CV-27] Context Sensitivity Improves Human-Machine Visual Alignment
【速读】:该论文旨在解决当前机器学习模型在处理输入时缺乏情境敏感性的问题,即现有模型将输入表示为高维嵌入空间中的固定点,这与人类基于环境动态调整对象及其关系表征的方式存在本质差异。解决方案的关键在于提出一种从神经网络嵌入中计算情境敏感相似性的方法,并将其应用于三元组奇数识别任务,其中锚图像同时作为上下文信息。该方法通过引入情境建模显著提升了奇数识别准确率,最高可达15%,且在原始和“人类对齐”的视觉基础模型上均表现出一致性改进。
链接: https://arxiv.org/abs/2604.13883
作者: Frieda Born,Tom Neuhäuser,Lukas Muttenthaler,Brett D. Roads,Bernhard Spitzer,Andrew K. Lampinen,Matt Jones,Klaus-Robert Müller,Michael C. Mozer
机构: Technische Universität Berlin (柏林工业大学); BIFOLD; Max Planck Institute for Human Development (人类发展马克斯普朗克研究所); Aignostics; Helmholtz Munich (慕尼黑赫尔姆霍兹研究中心); Technical University of Munich (慕尼黑工业大学); University College London (伦敦大学学院); Technische Universität Dresden (德累斯顿工业大学); Google DeepMind (谷歌深度智取); University of Colorado Boulder (科罗拉多大学博尔德分校); Korea University (韩国科学技术院); Max Planck Institute for Informatics (信息学马克斯普朗克研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Modern machine learning models typically represent inputs as fixed points in a high-dimensional embedding space. While this approach has been proven powerful for a wide range of downstream tasks, it fundamentally differs from the way humans process information. Because humans are constantly adapting to their environment, they represent objects and their relationships in a highly context-sensitive manner. To address this gap, we propose a method for context-sensitive similarity computation from neural network embeddings, applied to modeling a triplet odd-one-out task with an anchor image serving as simultaneous context. Modeling context enables us to achieve up to a 15% improvement in odd-one-out accuracy over a context-insensitive model. We find that this improvement is consistent across both original and “human-aligned” vision foundation models.
[CV-28] PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
【速读】:该论文旨在解决工业装配场景中图像生成技术难以准确模拟组件姿态与朝向的问题,从而提升异常检测模型的性能。现有生成方法通常忽略组件在装配过程中的空间关系,导致合成图像无法满足下游应用需求。其解决方案的关键在于提出一种名为PostureObjectStitch的新颖图像合成方法:首先通过条件解耦策略将多视角输入图像分解为高频、纹理和RGB特征;其次引入特征时序调制机制,在扩散模型的时间步中动态调整这些特征,实现从粗到细的渐进式生成并保持一致性;最后结合条件损失函数强化关键工业元素的语义准确性,并利用几何先验约束组件的空间定位,确保正确的装配关系。
链接: https://arxiv.org/abs/2604.13863
作者: Zebei Tong,Hongchang Chen,Yujie Lei,Gang Chen,Yushi Liu,Zhi Zheng,Hao Chen,Jieming Zhang,Ying Li,Dongpu Cao
机构: Beijing Institute of Technology (北京理工大学); The Hong Kong Polytechnic University (香港理工大学); Li Auto (理想汽车); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image generation technology can synthesize condition-specific images to supplement real-world industrial anomaly data and enhance anomaly detection model performance. Existing generation techniques rarely account for the pose and orientation of industrial components in assembly, making the generated images difficult to utilize for downstream application. To solve this, we propose a novel image synthesis approach, called PostureObjectStitch, that achieves accurate generation to meet the requirement of industrial assembly. A condition decoupling approach is introduced to separate input multi-view images into high-frequency, texture, and RGB features. The feature temporal modulation mechanism adapts these features across diffusion model time-steps, enabling progressive generation from coarse to fine details while maintaining consistency. To ensure semantic accuracy, we introduce a conditional loss that enhances critical industrial elements and a geometric prior that guides component positioning for correct assembly relationships. Comprehensive experimental results on the MureCom dataset, our newly contributed DreamAssembly dataset, and the downstream application validate the outstanding performance of our method.
[CV-29] Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image
【速读】:该论文旨在解决从单张肖像图像中重建完整三维头部模型的难题,现有方法普遍存在质量与速度之间的权衡:高保真度的流程通常依赖多阶段处理和针对每个个体的优化,而快速前向传播模型则难以恢复完整的几何结构和精细的外观细节。其解决方案的关键在于提出Any3DAvatar,一种快速且高质量的单图3D高斯头部虚拟人生成方法;核心创新包括:构建统一的数据集AnyHead,涵盖身份多样性、密集多视角监督及真实配饰,弥补现有头部数据在覆盖范围、全头几何和复杂外观上的不足;采用Plücker感知的结构化3D高斯骨架初始化,并通过一步条件去噪完成全头重建,实现单次前向传播即获得高保真结果;同时引入视图条件的辅助外观监督机制,在不增加推理成本的前提下提升新视角下的纹理细节。
链接: https://arxiv.org/abs/2604.13856
作者: Yujie Gao,Yao Xiao,Xiangnan Zhu,Ya Li,Yiyi Zhang,Liqing Zhang,Jianfu Zhang
机构: Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing a complete 3D head from a single portrait remains challenging because existing methods still face a sharp quality-speed trade-off: high-fidelity pipelines often rely on multi-stage processing and per-subject optimization, while fast feed-forward models struggle with complete geometry and fine appearance details. To bridge this gap, we propose Any3DAvatar, a fast and high-quality method for single-image 3D Gaussian head avatar generation, whose fastest setting reconstructs a full head in under one second while preserving high-fidelity geometry and texture. First, we build AnyHead, a unified data suite that combines identity diversity, dense multi-view supervision, and realistic accessories, filling the main gaps of existing head data in coverage, full-head geometry, and complex appearance. Second, rather than sampling unstructured noise, we initialize from a Plücker-aware structured 3D Gaussian scaffold and perform one-step conditional denoising, formulating full-head reconstruction into a single forward pass while retaining high fidelity. Third, we introduce auxiliary view-conditioned appearance supervision on the same latent tokens alongside 3D Gaussian reconstruction, improving novel-view texture details at zero extra inference cost. Experiments show that Any3DAvatar outperforms prior single-image full-head reconstruction methods in rendering fidelity while remaining substantially faster.
[CV-30] DiffMagicFace: Identity Consistent Facial Editing of Real Videos
【速读】:该论文旨在解决文本引导的面部视频编辑中面临的两大核心挑战:一是如何在视频帧间保持人脸身份的一致性,二是如何确保编辑语义在时序上连贯且自然。解决方案的关键在于提出了一种名为DiffMagicFace的新型视频编辑框架,其核心创新在于集成两个并行运行的微调模型——一个用于文本控制,另一个用于图像控制,二者协同作用以生成既保留身份特征又精准匹配编辑意图的视频帧;同时,作者构建了一个基于渲染和优化算法生成的多视角人脸数据集,从而无需依赖原始视频数据即可实现高质量、高一致性的编辑效果,尤其在复杂任务(如说话头视频或细粒度类别区分)中表现优异。
链接: https://arxiv.org/abs/2604.13841
作者: Huanghao Yin,Shenkun Xu,Kanle Shi,Junhai Yong,Bin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-conditioned image editing has greatly benefitted from the advancements in Image Diffusion Models. However, extending these techniques to facial video editing introduces challenges in preserving facial identity throughout the source video and ensuring consistency of the edited subject across frames. In this paper, we introduce DiffMagicFace, a unique video editing framework that integrates two fine-tuned models for text and image control. These models operate concurrently during inference to produce video frames that maintain identity features while seamlessly aligning with the editing semantics. To ensure the consistency of the edited videos, we develop a dataset comprising images showcasing various facial perspectives for each edited subject. The creation of a data set is achieved through rendering techniques and the subsequent application of optimization algorithms. Remarkably, our approach does not depend on video datasets but still delivers high-quality results in both consistency and content. The excellent effect holds even for complex tasks like talking head videos and distinguishing closely related categories. The videos edited using our framework exhibit parity with videos that are made using traditional rendering software. Through comparative analysis with current state-of-the-art methods, our framework demonstrates superior performance in both visual appeal and quantitative metrics.
[CV-31] A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)在植物病害自动诊断中存在两个核心问题:一是标准池化层难以有效捕捉长距离空间依赖关系,二是模型参数量大、内存占用高,不利于在资源受限的便携设备上部署。其解决方案的关键在于提出一种轻量级混合CNN-LSTM架构,通过在CNN特征提取后引入长短期记忆(Long Short-Term Memory, LSTM)层来建模特征图内的空间-序列关系,从而增强对病害模式的表征能力;同时该结构显著降低模型体积至1.86 MB(较传统CNN减少70%),并结合针对性图像增强策略提升诊断准确性,在豆叶病害分类任务中实现了94.38%的准确率与高达99.22%的F1分数,为边缘计算环境下的实时农业决策支持提供了高效可靠的框架。
链接: https://arxiv.org/abs/2604.13835
作者: Hye Jin Rhee,Joseph Damilola Akinyemi
机构: University of York (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate and resource-efficient automated diagnosis is a cornerstone of modern agricultural expert systems. While Convolutional Neural Networks (CNNs) have established benchmarks in plant pathology, their ability to capture long-range spatial dependencies is often limited by standard pooling layers, and their high memory footprint hinders deployment on portable devices. This paper proposes a lightweight hybrid CNN-LSTM system for bean leaf disease classification. By integrating an LSTM layer to model the spatial-sequential relationships within feature maps, our hybrid architecture achieves a 94.38% accuracy while maintaining an exceptionally small footprint of 1.86 MB; a 70% reduction in size compared to traditional CNN-based systems. Furthermore, we provide a systematic evaluation of image augmentation strategies, demonstrating that tailored transformations are superior to generic combinations for maintaining the integrity of diagnostic patterns. Results on the \textitibean dataset confirm that the proposed system achieves state-of-the-art F1 scores of 99.22% with EfficientNet-B7+LSTM, providing a robust and scalable framework for real-time agricultural decision support in resource-constrained environments. The code and augmented datasets used in this study are publicly available on this \hrefthis https URLGithub repo.
[CV-32] Gaslight Gatekeep V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在高风险场景中易受谄媚式操纵(sycophantic manipulation)的问题,特别是探讨模型内部视觉表征与人类神经活动的对齐程度是否影响其对抗性语言干扰的鲁棒性。解决方案的关键在于通过量化模型视觉表征与人类fMRI响应之间的脑部对齐度(brain alignment),发现早期视觉皮层(V1–V3)区域的高对齐度显著降低模型对谄媚攻击的敏感性(相关系数 r = -0.441,p < 0.05),尤其在存在否定类攻击(existence denial attacks)时效应最强(r = -0.597)。这表明,忠实于低层次视觉编码的模型结构能为对抗性语言操控提供可测量的防御锚点,从而为提升VLM的AI安全性提供了新的机制理解与设计方向。
链接: https://arxiv.org/abs/2604.13803
作者: Arya Shah,Vaibhav Tripathi,Mayank Singh,Chaklam Silpasuwanchai
机构: Indian Institute of Technology Gandhinagar(印度理工学院甘纳加纳尔分校); Asian Institute of Technology(亚洲理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 9 figures, 13 tables
Abstract:Vision-language models are increasingly deployed in high-stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open-weight vision-language models spanning 6 architecture families and a 40 \times parameter range (256M–10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two-turn gaslighting prompts spanning 5 categories and 10 difficulty levels. Region-of-interest analysis reveals that alignment specifically in early visual cortex (V1–V3) is a reliable negative predictor of sycophancy ( r = -0.441 , BCa 95% CI [-0.740, -0.031] ), with all 12 leave-one-out correlations negative and the strongest effect for existence denial attacks ( r = -0.597 , p = 0.040 ). This anatomically specific relationship is absent in higher-order category-selective regions, suggesting that faithful low-level visual encoding provides a measurable anchor against adversarial linguistic override in vision-language models. We release our code on \hrefthis https URLGitHub and dataset on \hrefthis https URLHugging Face
[CV-33] DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement
【速读】:该论文旨在解决少样本字体生成(Few-shot Font Generation)中难以捕捉复杂字体风格且生成样本局部特征不清晰的问题。解决方案的关键在于提出了一种对比式字体生成策略DRG-Font,其核心是通过分解风格(style)与内容(content)嵌入空间来学习复杂的字形属性:首先利用多尺度风格头模块(Multi-scale Style Head Block, MSHB)和多尺度内容头模块(Multi-scale Content Head Block, MCHB)分离字体的风格先验与形状先验;其次引入参考选择(Reference Selection, RS)模块动态选取最优风格参考;最后通过多融合上采样模块(Multi-Fusion Upsampling Block, MFUB)将参考风格先验与目标内容先验融合以生成目标字形,从而在保持局部细节的同时实现风格一致性。
链接: https://arxiv.org/abs/2604.13797
作者: Rejoy Chakraborty,Prasun Roy,Saumik Bhattacharya,Umapada Pal
机构: Indian Statistical Institute Kolkata (印度统计研究所加尔各答); Indian Institute of Technology Kharagpur (印度理工学院克哈拉格浦尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Few-shot Font Generation aims to generate stylistically consistent glyphs from a few reference glyphs. However, capturing complex font styles from a few exemplars remains challenging, and the existing methods often struggle to retain discernible local characteristics in generated samples. This paper introduces DRG-Font, a contrastive font generation strategy that learns complex glyph attributes by decomposing style and content embedding spaces. For optimal style supervision, the proposed architecture incorporates a Reference Selection (RS) Module to dynamically select the best style reference from an available pool of candidates. The network learns to decompose glyph attributes into style and shape priors through a Multi-scale Style Head Block (MSHB) and a Multi-scale Content Head Block (MCHB). For style adaptation, a Multi-Fusion Upsampling Block (MFUB) produces the target glyph by combining the reference style prior and target content prior. The proposed method demonstrates significant improvements over state-of-the-art approaches across multiple visual and analytical benchmarks.
[CV-34] Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training
【速读】:该论文旨在解决病理图像中淋巴瘤亚型(间变性大细胞淋巴瘤ALCL与经典霍奇金淋巴瘤cHL)自动分类的临床应用难题,尤其是克服全监督训练对专家标注资源的高度依赖问题。其解决方案的关键在于采用弱监督学习策略,即通过在整张切片(whole-slide-image)级别自动标注图像块(image patch),从而显著降低人工标注成本并提升模型在真实医疗场景中的可部署性。实验表明,基于10万张图像块的弱监督训练使ViT模型在测试集上达到91.85%准确率、0.92 F1分数和0.98 AUC,验证了该方法在临床深度学习模型开发中的可行性与有效性。
链接: https://arxiv.org/abs/2604.13795
作者: Nghia(Andy)Nguyen,Amer Wahed,Andy Quesada,Yasir Ali,Hanadi El Achi,Y. Helen Zhang,Jocelyn Ursua,Alex Banerjee,Sahib Kalra,L. Jeffrey Medeiros,Jie Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 6 figures, 1 table
Abstract:Vision transformers (ViT) have been shown to allow for more flexible feature detection and can outperform convolutional neural network (CNN) when pre-trained on sufficient data. Due to their promising feature detection capabilities, we deployed ViTs for morphological classification of anaplastic large cell lymphoma (ALCL) versus classic Hodgkin lymphoma (cHL). We had previously designed a ViT model which was trained on a small dataset of 1,200 image patches in fully supervised training. That model achieved a diagnostic accuracy of 100% and an F1 score of 1.0 on the independent test set. Since fully supervised training is not a practical method due to lack of expertise resources in both the training and testing phases, we conducted a recent study on a modified approach to training data (weakly supervised training) and show that labeling training image patch automatically at the slide level of each whole-slide-image is a more practical solution for clinical use of Vision Transformer. Our ViT model, trained on a larger dataset of 100,000 image patches, yields evaluation metrics with significant accuracy, F1 score, and area under the curve (AUC) at 91.85%, 0.92, and 0.98, respectively. These are respectable values that qualify this ViT model, with weakly supervised training, as a suitable tool for a deep learning module in clinical model development using automated image patch extraction.
[CV-35] From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
【速读】:该论文旨在解决同步第三人称(Exo)到第一人称(Ego)视频生成任务中因数据同步导致的时空不连续性问题,这种不连续性违背了传统视频生成模型对平滑运动的假设,从而影响生成视频的连贯性。其解决方案的关键在于提出一种名为Syn2Seq-Forcing的序列建模框架,通过将源视频与目标视频在时序上插值形成单一连续信号,重构Exo-to-Ego任务为序列信号建模而非传统的条件输出任务,使基于扩散机制的序列模型(如Diffusion Forcing Transformers, DFoT)能够更有效地捕捉帧间一致的过渡,实验表明仅对视频进行插值即可显著提升性能,凸显了时空不连续性是核心挑战。
链接: https://arxiv.org/abs/2604.13793
作者: Mohammad Mahdi,Nedko Savov,Danda Pani Paudel,Luc Van Gool
机构: INSAIT; Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥赫里德斯基”)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Exo-to-Ego video generation aims to synthesize a first-person video from a synchronized third-person view and corresponding camera poses. While paired supervision is available, synchronized exo-ego data inherently introduces substantial spatio-temporal and geometric discontinuities, violating the smooth-motion assumptions of standard video generation benchmarks. We identify this synchronization-induced jump as the central challenge and propose Syn2Seq-Forcing, a sequential formulation that interpolates between the source and target videos to form a single continuous signal. By reframing Exo2Ego as sequential signal modeling rather than a conventional condition-output task, our approach enables diffusion-based sequence models, e.g. Diffusion Forcing Transformers (DFoT), to capture coherent transitions across frames more effectively. Empirically, we show that interpolating only the videos, without performing pose interpolation already produces significant improvements, emphasizing that the dominant difficulty arises from spatio-temporal discontinuities. Beyond immediate performance gains, this formulation establishes a general and flexible framework capable of unifying both Exo2Ego and Ego2Exo generation within a single continuous sequence model, providing a principled foundation for future research in cross-view video synthesis.
[CV-36] PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation
【速读】:该论文旨在解决超声图像中病灶分割的挑战,尤其是由低对比度、模糊边界和显著尺度变化导致的分割精度不足问题。现有基于深度学习的方法在处理尺度变化和不清晰肿瘤边界时仍存在局限性。解决方案的关键在于提出一种渐进式边界增强U-Net(PBE-UNet),其核心创新包括两个模块:一是尺度感知聚合模块(Scale-Aware Aggregation Module, SAAM),通过动态调整感受野来捕获鲁棒的多尺度上下文信息;二是边界引导特征增强模块(Boundary-Guided Feature Enhancement, BGFE),该模块将狭窄的边界预测逐步扩展为更广的空间注意力图,从而有效覆盖分割误差较大的区域并增强模型对这些困难区域的关注能力。
链接: https://arxiv.org/abs/2604.13791
作者: Chen Wang,Yixin Zhu,Yongbin Zhu,Fengyuan Shi,Qi Li,Jun Wang,Zuozhu Liu,Keli Hu
机构: Shaoxing University (绍兴大学); Northeastern University (东北大学); Chongqing University (重庆大学); Hangzhou City University (杭州城市大学); Zhejiang University (浙江大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 14 figures
Abstract:Accurate lesion segmentation in ultrasound images is essential for preventive screening and clinical diagnosis, yet remains challenging due to low contrast, blurry boundaries, and significant scale variations. Although existing deep learning-based methods have achieved remarkable performance, these methods still struggle with scale variations and indistinct tumor boundaries. To address these challenges, we propose a progressive boundary enhanced U-Net (PBE-UNet). Specially, we first introduce a scale-aware aggregation module (SAAM) that dynamically adjusts its receptive field to capture robust multi-scale contextual information. Then, we propose a boundary-guided feature enhancement (BGFE) module to enhance the feature representations. We find that there are large gaps between the narrow boundary and the wide segmentation error areas. Unlike existing methods that treat boundaries as static masks, the BGFE module progressively expands the narrow boundary prediction into broader spatial attention maps. Thus, broader spatial attention maps could effectively cover the wider segmentation error regions and enhance the model’s focus on these challenging areas. We conduct expensive experiments on four benchmark ultrasound datasets, BUSI, Dataset B, TN3K, and BP. The experimental results how that our proposed PBE-UNet outperforms state-of-the-art ultrasound image segmentation methods. The code is at this https URL.
[CV-37] mporally Consistent Long-Term Memory for 3D Single Object Tracking CVPR2026
【速读】:该论文旨在解决3D单目标跟踪(3D-SOT)中长期上下文建模能力不足的问题,现有基于记忆的方法受限于短期帧数,主要原因是时间特征不一致性与内存开销过大。解决方案的关键在于提出ChronoTrack框架,其通过一组可学习的记忆令牌(memory tokens)实现高效且鲁棒的长时记忆机制:一方面引入时间一致性损失(temporal consistency loss)以对齐跨帧特征,缓解时间漂移;另一方面设计记忆循环一致性损失(memory cycle consistency loss),通过记忆-点-记忆的循环遍历策略促使每个令牌编码序列中多样且判别性的目标表征。该方法在保持紧凑记忆结构的同时显著提升长期跟踪性能,并在单张RTX 4090 GPU上达到42 FPS的实时速度。
链接: https://arxiv.org/abs/2604.13789
作者: Jaejoon Yoo,SuBeen Lee,Yerim Jeon,Miso Lee,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings
Abstract:3D Single Object Tracking (3D-SOT) aims to localize a target object across a sequence of LiDAR point clouds, given its 3D bounding box in the first frame. Recent methods have adopted a memory-based approach to utilize previously observed features of the target object, but remain limited to only a few recent frames. This work reveals that their temporal capacity is fundamentally constrained to short-term context due to severe temporal feature inconsistency and excessive memory overhead. To this end, we propose a robust long-term 3D-SOT framework, ChronoTrack, which preserves the temporal feature consistency while efficiently aggregating the diverse target features via long-term memory. Based on a compact set of learnable memory tokens, ChronoTrack leverages long-term information through two complementary objectives: a temporal consistency loss and a memory cycle consistency loss. The former enforces feature alignment across frames, alleviating temporal drift and improving the reliability of proposed long-term memory. In parallel, the latter encourages each token to encode diverse and discriminative target representations observed throughout the sequence via memory-point-memory cyclic walks. As a result, ChronoTrack achieves new state-of-the-art performance on multiple 3D-SOT benchmarks, demonstrating its effectiveness in long-term target modeling with compact memory while running at real-time speed of 42 FPS on a single RTX 4090 GPU. The code is available at this https URL
[CV-38] Failure Identification in Imitation Learning Via Statistical and Semantic Filtering ICRA2026
【速读】:该论文旨在解决模仿学习(Imitation Learning, IL)在机器人实际部署中因罕见事件(如硬件故障、缺陷部件、意外人类行为等)导致执行失败的问题,尤其是现有基于视觉的异常检测(Vision-based Anomaly Detection, AD)方法无法区分真实失败与良性偏差的局限性。解决方案的关键在于提出FIDeL(Failure Identification in Demonstration Learning),一个与策略无关的失败检测模块:首先利用先进的AD方法构建规范演示的紧凑表征,并通过最优传输匹配对输入观测进行对齐以生成异常分数和热图;随后基于扩展的合规预测(conformal prediction)推导时空阈值,并结合视觉-语言模型(Vision-Language Model, VLM)进行语义过滤,从而有效区分良性异常与真实失败。该方案在自建的多模态真实任务数据集BotFails上显著优于现有基线,异常检测AUROC提升5.30%,失败检测准确率提升17.38%。
链接: https://arxiv.org/abs/2604.13788
作者: Quentin Rolland,Fabrice Mayran de Chamisso,Jean-Baptiste Mouret
机构: Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France; Inria, CNRS, Université de Lorraine, LORIA, F-54000 Nancy, France; Bleu Robotics, Paris, France
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, Appendix coming soon, accepted at ICRA 2026
Abstract:Imitation learning (IL) policies in robotics deliver strong performance in controlled settings but remain brittle in real-world deployments: rare events such as hardware faults, defective parts, unexpected human actions, or any state that lies outside the training distribution can lead to failed executions. Vision-based Anomaly Detection (AD) methods emerged as an appropriate solution to detect these anomalous failure states but do not distinguish failures from benign deviations. We introduce FIDeL (Failure Identification in Demonstration Learning), a policy-independent failure detection module. Leveraging recent AD methods, FIDeL builds a compact representation of nominal demonstrations and aligns incoming observations via optimal transport matching to produce anomaly scores and heatmaps. Spatio-temporal thresholds are derived with an extension of conformal prediction, and a Vision-Language Model (VLM) performs semantic filtering to discriminate benign anomalies from genuine failures. We also introduce BotFails, a multimodal dataset of real-world tasks for failure detection in robotics. FIDeL consistently outperforms state-of-the-art baselines, yielding +5.30% percent AUROC in anomaly detection and +17.38% percent failure-detection accuracy on BotFails compared to existing methods.
[CV-39] Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation CVPR2026
【速读】:该论文旨在解决如何将稀疏混合专家(Sparse Mixture-of-Experts, MoE)层有效集成到卷积神经网络(Convolutional Neural Networks, CNNs)中以提升密集预测任务(如语义分割)性能的问题。当前MoE多用于Transformer架构,而在CNN中的应用仍不一致,且多数方法聚焦于细粒度的滤波器或通道级专家划分。本文的关键解决方案是提出一种粗粒度的、基于图像块(patch-wise)的稀疏MoE设计,其中局部区域被路由至一小部分卷积专家进行处理;通过在Cityscapes和BDD100K数据集上对编码器-解码器与骨干网络结构的系统性实验,验证了该方案可在几乎无额外计算开销的情况下实现显著性能提升(最高达+3.9 mIoU),并揭示了架构选择对路由动态和专家专业化程度的强敏感性。
链接: https://arxiv.org/abs/2604.13761
作者: Svetlana Pavlitska,Haixi Fan,Konstantin Ditschuneit,J. Marius Zöllner
机构: FZI Research Center for Information Technology (弗劳恩霍夫信息技术研究中心); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the SAIAD workshop at CVPR 2026
Abstract:Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed-forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine-grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch-wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder-decoder and backbone-based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture-dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN-based dense prediction. Our code is available at this https URL.
[CV-40] ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction CVPR2026
【速读】:该论文旨在解决长时多视角动态场景重建中面临的挑战,即如何在保证高时间连贯性和结构一致性的同时,实现高效且可扩展的重建方法。现有动态高斯方法通常分为帧级(Frame-Stream)和片段级(Clip)两类:前者虽具可扩展性但时间稳定性差,后者虽能保持局部一致性却存在内存消耗大和序列长度受限的问题。其解决方案的关键在于提出一种混合式框架 ClipGStream,该框架在片段级别进行流优化(clip-level stream optimization),通过引入独立于片段的时空场(spatio-temporal fields)与残差锚点补偿机制来高效捕捉局部运动变化,同时利用跨片段继承的锚点和解码器维持全局结构一致性,从而在减少内存开销的前提下实现高质量、无闪烁的长视频动态重建。
链接: https://arxiv.org/abs/2604.13746
作者: Jie Liang,Jiahao Wu,Chao Wang,Jiayu Yang,Xiaoyun Zheng,Kaiqiang Xiong,Zhanke Wang,Jinbo Yan,Feng Gao,Ronggang Wang
机构: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University; Pengcheng Laboratory; Peking University; MIGU Video Co., Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project pages: this https URL
Abstract:Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length. We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency. The project page is available at: this https URL
[CV-41] ReConText3D: Replay-based Continual Text-to-3D Generation CVPR
【速读】:该论文旨在解决文本到3D生成(text-to-3D generation)模型在持续学习(continual learning)场景下的灾难性遗忘(catastrophic forgetting)问题,即模型在增量学习新3D类别时无法保留对先前类别的生成能力。解决方案的关键在于提出ReConText3D框架,其核心创新包括:通过文本嵌入的k-Center聚类方法构建紧凑且多样化的回放记忆(replay memory),实现无需修改模型架构即可高效重现历史知识;同时引入Toy4K-CL基准数据集,提供平衡且语义多样的类别增量划分,从而系统评估持续学习性能。实验表明,该方法在多种生成骨干网络上均显著优于基线,保持了对旧类和新类的高质量生成能力。
链接: https://arxiv.org/abs/2604.13730
作者: Muhammad Ahmed Ullah Khan,Muhammad Haris Bin Amir,Didier Stricker,Muhammad Zeshan Afzal
机构: DFKI; RPTU Kaiserslautern-Landau
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR Findings 2026
Abstract:Continual learning enables models to acquire new knowledge over time while retaining previously learned capabilities. However, its application to text-to-3D generation remains unexplored. We present ReConText3D, the first framework for continual text-to-3D generation. We first demonstrate that existing text-to-3D models suffer from catastrophic forgetting under incremental training. ReConText3D enables generative models to incrementally learn new 3D categories from textual descriptions while preserving the ability to synthesize previously seen assets. Our method constructs a compact and diverse replay memory through text-embedding k-Center selection, allowing representative rehearsal of prior knowledge without modifying the underlying architecture. To systematically evaluate continual text-to-3D learning, we introduce Toys4K-CL, a benchmark derived from the Toys4K dataset that provides balanced and semantically diverse class-incremental splits. Extensive experiments on the Toys4K-CL benchmark show that ReConText3D consistently outperforms all baselines across different generative backbones, maintaining high-quality generation for both old and new classes. To the best of our knowledge, this work establishes the first continual learning framework and benchmark for text-to-3D generation, opening a new direction for incremental 3D generative modeling. Project page is available at: this https URL.
[CV-42] Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests
【速读】:该论文旨在解决林业感知中合成数据到真实数据的域迁移(synthetic-to-real transfer)问题,其中真实数据仅包含粗粒度的树类别标签,而合成数据则提供细粒度的树干(trunk)与树冠(crown)标注。为应对这一挑战,作者提出了一种四阶段协议,用于分离领域差异(domain shift)与标注粒度不匹配(granularity mismatch)。其核心解决方案是“粒度感知蒸馏”(granularity-aware distillation),通过在logit空间合并和掩码统一(mask unification)策略,将合成数据中蕴含的结构先验知识从细粒度教师模型有效迁移到仅具有粗标签的学生模型中,从而显著提升小尺度或远距离树木的分割性能(mask AP增益明显),为在标签粒度受限条件下的仿真-真实迁移提供了可靠基准。
链接: https://arxiv.org/abs/2604.13722
作者: Pankaj Deoli,Atef Tej,Anmol Ashri,Anandatirtha JS,Karsten Berns
机构: University of Kaiserslautern-Landau (凯撒斯劳滕-兰道大学); Robotics Research Lab (机器人研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We address the challenge of synthetic-to-real transfer in forestry perception where real data have only coarse Tree labels while synthetic data provide fine-grained trunk/crown annotations. We introduce MGTD, a mixed-granularity dataset with 53k synthetic and 3.6k real images, and a four-stage protocol isolating domain shift and granularity mismatch. Our core contribution is granularity-aware distillation, which transfers structural priors from fine-grained synthetic teachers to a coarse-label student via logit-space merging and mask unification. Experiments show consistent mask AP gains, especially for small/distant trees, establishing a testbed for Sim-Real transfer under label granularity constraints.
[CV-43] SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在检索任务中适应性不足的问题,尤其是现有方法如全量微调(full fine-tuning)和低秩适配(LoRA)等依赖侵入式参数更新,可能破坏预训练语义空间并削弱推理所需的结构化知识。解决方案的关键在于提出SLQ框架,通过引入少量共享潜在查询(Shared Latent Queries, SLQs),以非侵入方式激活MLLM中已有的预训练表征:这些查询被附加到文本与图像token序列末尾,利用模型原生因果注意力机制作为全局聚合接口,在不修改主干网络的前提下生成统一空间中的紧凑嵌入,从而实现高效且有效的检索适配。
链接: https://arxiv.org/abs/2604.13710
作者: Haoran Lou,Ziyan Liu,Chunxiao Fan,Yuexin Wu,Yue Ming
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) exhibit strong reasoning and world knowledge, yet adapting them for retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. In this work, we argue that adapting MLLMs for retrieval should focus on eliciting pre-trained representations rather than overwriting them. To this end, we propose SLQ, an effective and efficient framework that adapts a frozen MLLM into a retriever through a small set of Shared Latent Queries. Appended to the end of both text and image token sequences, these queries leverage the model’s native causal attention to serve as global aggregation interfaces, producing compact embeddings in a unified space while keeping the backbone unchanged. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench. The results demonstrate that SLQ, which preserves pre-trained representations, provides an effective and efficient framework for adapting MLLMs to retrieval.
[CV-44] Med-CAM: Minimal Evidence for Explaining Medical Decision Making
【速读】:该论文旨在解决医学影像中深度学习模型缺乏可解释性的问题,即现有医疗人工智能系统多为“黑箱”决策模式,无法提供清晰、可信的诊断依据,从而限制了临床医生对AI结果的理解与信任。解决方案的关键在于提出Med-CAM框架,通过Classifier Activation Matching(分类器激活匹配)机制,从零开始训练一个分割网络,生成最小且锐利的掩码(mask),以精准定位支撑模型决策的关键证据区域。该方法确保解释既忠实于模型行为,又具备临床可读性,显著优于传统空间解释方法(如Grad-CAM和注意力图),能够准确捕捉形状、纹理和边界等细节,实现高保真、证据驱动的诊断解释。
链接: https://arxiv.org/abs/2604.13695
作者: Pirzada Suhail,Aditya Anand,Amit Sethi
机构: IIT Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable and interpretable decision-making is essential in medical imaging, where diagnostic outcomes directly influence patient care. Despite advances in deep learning, most medical AI systems operate as opaque black boxes, providing little insight into why a particular diagnosis was reached. In this paper, we introduce Med-CAM, a framework for generating minimal and sharp maps as evidence-based explanations for Medical decision making via Classifier Activation Matching. Med-CAM trains a segmentation network from scratch to produce a mask that highlights the minimal evidence critical to model’s decision for any seen or unseen image. This ensures that the explanation is both faithful to the network’s behaviour and interpretable to clinicians. Experiments show, unlike prior spatial explanation methods, such as Grad-CAM and attention maps, which yield only fuzzy regions of relative importance, Med-CAM with its superior spatial awareness to shapes, textures, and boundaries, delivers conclusive, evidence-based explanations that faithfully replicate the model’s prediction for any given image. By explicitly constraining explanations to be compact, consistent with model activations, and diagnostic alignment, Med-CAM advances transparent AI to foster clinician understanding and trust in high-stakes medical applications such as pathology and radiology.
[CV-45] Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data
【速读】:该论文旨在解决3D编辑中语义一致性与局部不变性难以兼顾的问题,现有方法如多视图编辑在投影回3D时易引入失真,而体素(Voxel)-based方法则受限于可编辑区域范围和修改尺度。其解决方案的关键在于提出Beyond Voxel 3D Editing(BVE)框架,该框架包含自建的大规模3D编辑数据集,并在此基础上通过轻量级可训练模块增强基础图像到3D生成架构,实现无需全模型重训练即可高效注入文本语义信息;同时引入无标注的3D掩码策略以保留局部不变性,从而在编辑过程中保持未修改区域的视觉一致性。
链接: https://arxiv.org/abs/2604.13688
作者: Yizhao Xu,Hongyuan Zhu,Caiyun Liu,Tianfu Wang,Keyu Chen,Sicheng Xu,Jiaolong Yang,Nicholas Jing Yuan,Qi Zhang
机构: Peking University (北京大学); HKUST(GZ) (香港科技大学(广州)); Microsoft AI (微软人工智能); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:3D editing refers to the ability to apply local or global modifications to 3D assets. Effective 3D editing requires maintaining semantic consistency by performing localized changes according to prompts, while also preserving local invariance so that unchanged regions remain consistent with the original. However, existing approaches have significant limitations: multi-view editing methods incur losses when projecting back to 3D, while voxel-based editing is constrained in both the regions that can be modified and the scale of modifications. Moreover, the lack of sufficiently large editing datasets for training and evaluation remains a challenge. To address these challenges, we propose a Beyond Voxel 3D Editing (BVE) framework with a self-constructed large-scale dataset specifically tailored for 3D editing. Building upon this dataset, our model enhances a foundational image-to-3D generative architecture with lightweight, trainable modules, enabling efficient injection of textual semantics without the need for expensive full-model retraining. Furthermore, we introduce an annotation-free 3D masking strategy to preserve local invariance, maintaining the integrity of unchanged regions during editing. Extensive experiments demonstrate that BVE achieves superior performance in generating high-quality, text-aligned 3D assets, while faithfully retaining the visual characteristics of the original input.
[CV-46] From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storag e
【速读】:该论文旨在解决视频数据在DNA分子中高效存储的难题,其核心挑战在于传统方法将视频压缩与DNA编码分阶段处理,导致生物化学约束与压缩目标之间存在根本性错位。解决方案的关键在于提出HELIX框架,这是一个端到端神经网络,首次联合优化视频压缩与DNA编码过程;其创新点在于利用基于token的表示形式天然契合DNA的四碱基字母表(ATCG),并通过TK-SCONE(Token-Kronecker Structured Constraint-Optimized Neural Encoding)实现:采用Kronecker结构混合打破空间相关性,并通过有限状态机(FSM)映射确保生化可行性,从而在视觉质量、掩码预测和DNA合成效率三方面同步优化token分布,实现了1.91比特/核苷酸的编码效率,标志着神经视频编解码器从设计之初即面向生物载体的新范式。
链接: https://arxiv.org/abs/2604.13667
作者: Cihan Ruan,Lebin Zhou,Bingqing Zhao,Rongduo Han,Qiming Yuan,Chenchen Zhu,Linyi Han,Liang Yang,Wei Wang,Wei Jiang,Nam Ling
机构: Santa Clara University (圣克拉拉大学); Stanford University (斯坦福大学); Nankai University (南开大学); Futurewei Technologies (未来wei科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:
Abstract:DNA-based storage has emerged as a promising approach to the global data crisis, offering molecular-scale density and millennial-scale stability at low maintenance cost. Over the past decade, substantial progress has been made in storing text, images, and files in DNA – yet video remains an open challenge. The difficulty is not merely technical: effective video DNA storage requires co-designing compression and molecular encoding from the ground up, a challenge that sits at the intersection of two fields that have largely evolved independently. In this work, we present HELIX, the first end-to-end neural network jointly optimizing video compression and DNA encoding – prior approaches treat the two stages independently, leaving biochemical constraints and compression objectives fundamentally misaligned. Our key insight: token-based representations naturally align with DNA’s quaternary alphabet – discrete semantic units map directly to ATCG bases. We introduce TK-SCONE (Token-Kronecker Structured Constraint-Optimized Neural Encoding), which achieves 1.91 bits per nucleotide through Kronecker-structured mixing that breaks spatial correlations and FSM-based mapping that guarantees biochemical constraints. Unlike two-stage approaches, HELIX learns token distributions simultaneously optimized for visual quality, prediction under masking, and DNA synthesis efficiency. This work demonstrates for the first time that learned compression and molecular storage converge naturally at token representations – suggesting a new paradigm where neural video codecs are designed for biological substrates from the ground up.
[CV-47] VRAG -DFD: Verifiable Retrieval-Augmentation for MLLM -based Deepfake Detection
【速读】:该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的Deepfake检测(Deepfake Detection, DFD)方法中存在两大核心问题:一是缺乏高质量、动态关联的伪造知识供给,导致模型性能受限;二是面对噪声参考信息时,MLLMs难以具备关键推理能力。为解决上述问题,论文提出VRAG-DFD框架,其关键在于融合检索增强生成(Retrieval-Augmented Generation, RAG)与强化学习(Reinforcement Learning, RL)技术:RAG实现从自建伪造知识库(Forensic Knowledge Database, FKD)和伪造思维链数据集(Forensic Chain-of-Thought Dataset, F-CoT)中精准动态检索伪造知识,而RL则通过三阶段训练策略(对齐-监督微调-GRPO)逐步培养模型的关键推理能力,从而显著提升DFD任务中的泛化性能。
链接: https://arxiv.org/abs/2604.13660
作者: Hui Han,Shunli Wang,Yandan Zhao,Taiping Yao,Shouhong Ding
机构: Shanghai Jiao Tong University (上海交通大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In Deepfake Detection (DFD) tasks, researchers proposed two types of MLLM-based methods: complementary combination with small DFD detectors, or static forgery knowledge this http URL lack of professional forgery knowledge hinders the performance of these this http URL solve this, we deeply considered two insightful issues: How to provide high-quality associated forgery knowledge for MLLMs? AND How to endow MLLMs with critical reasoning abilities given noisy reference information? Notably, we attempted to address above two questions with preliminary answers by leveraging the combination of Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL).Through RAG and RL techniques, we propose the VRAG-DFD framework with accurate dynamic forgery knowledge retrieval and powerful critical reasoning this http URL, in terms of data, we constructed two datasets with RAG: Forensic Knowledge Database (FKD) for DFD knowledge annotation, and Forensic Chain-of-Thought Dataset (F-CoT), for critical CoT this http URL terms of model training, we adopt a three-stage training method (Alignment-SFT-GRPO) to gradually cultivate the critical reasoning ability of the this http URL terms of performance, VRAG-DFD achieved SOTA and competitive performance on DFD generalization testing.
[CV-48] ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation
【速读】:该论文旨在解决复杂室内环境中具身智能体在长时程任务中面临的灾难性遗忘(catastrophic forgetting)、空间不一致性(spatial inconsistency)以及执行僵化等问题。其解决方案的关键在于提出ESCPE(Episodic Spatial Memory Coupled with an Adaptive Policy for Execution),通过感知-定位-执行的紧密耦合工作流实现鲁棒性和灵活性:一方面,利用时空融合映射模块构建无深度依赖的持久3D空间记忆,并结合记忆驱动的目标定位模块生成精确交互掩码;另一方面,采用自适应执行策略动态协调全局主动导航与局部反应式操作,以捕捉机会目标。该方法在ALFRED基准上取得最优性能,显著提升路径长度加权指标并保持无详细指导下的长期任务鲁棒性。
链接: https://arxiv.org/abs/2604.13633
作者: Jingjing Qian,Zeyuan He,Chen Shi,Lei Xiao,Li Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Coordinating navigation and manipulation with robust performance is essential for embodied AI in complex indoor environments. However, as tasks extend over long horizons, existing methods often struggle due to catastrophic forgetting, spatial inconsistency, and rigid execution. To address these issues, we propose ESCAPE (Episodic Spatial Memory Coupled with an Adaptive Policy for Execution), operating through a tightly coupled perception-grounding-execution workflow. For robust perception, ESCAPE features a Spatio-Temporal Fusion Mapping module to autoregressively construct a depth-free, persistent 3D spatial memory, alongside a Memory-Driven Target Grounding module for precise interaction mask generation. To achieve flexible action, our Adaptive Execution Policy dynamically orchestrates proactive global navigation and reactive local manipulation to seize opportunistic targets. ESCAPE achieves state-of-the-art performance on the ALFRED benchmark, reaching 65.09% and 60.79% success rates in test seen and unseen environments with step-by-step instructions. By reducing redundant exploration, our ESCAPE attains substantial improvements in path-length-weighted metrics and maintains robust performance (61.24% / 56.04%) even without detailed guidance for long-horizon tasks.
[CV-49] What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering
【速读】:该论文旨在解决当前用于评估数据集偏倚(dataset bias)的主流方法——即通过训练分类模型区分不同数据集并以高准确率作为语义差异证据——所存在的根本性缺陷。作者指出,这种做法假设标准图像增强能有效抑制低层非语义线索,但实证表明在大规模自然图像集合中,高分类准确率主要由分辨率相关的结构伪影(resolution-based artifacts)驱动,这些伪影是图像原始分辨率分布和重缩放过程中插值效应形成的稳定指纹,即使在常规图像损坏下仍存在。为克服此问题,论文提出一种无监督的新方案:不依赖数据集标签进行监督分类,而是直接利用基础视觉模型提取的语义丰富特征进行聚类,从而更真实地衡量语义可分性。关键创新在于绕过监督分类流程,转而基于聚类结果评估语义相似性,实验显示该方法在主流网络规模数据集上显著降低了传统方法报告的“高可分性”,揭示了现有评估方式对语义偏倚的严重高估。
链接: https://arxiv.org/abs/2604.13610
作者: Amir Hossein Saleknia,Mohammad Sabokrou
机构: Iran University of Science and Technology (伊朗科学技术大学); Okinawa Institute of Science and Technology (冲绳科学技术大学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models achieve strong dataset classification even on non-semantic, procedurally generated images, proving their reliance on superficial cues. To address this issue, we revisit this decades-old idea of dataset separability, but not with supervised classification. Instead, we introduce an unsupervised approach that measures true semantic separability. Our framework directly assesses semantic similarity by clustering semantically-rich features from foundational vision models, deliberately bypassing supervised classification on dataset labels. When applied to major web-scale datasets, the primary focus of this work, the high separability reported by supervised methods largely vanishes, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by an overwhelming margin.
[CV-50] VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
【速读】:该论文旨在解决跨视角(egocentric与exocentric)的实例级目标分割问题,该任务在具身智能(embodied AI)和远程协作等场景中至关重要,但因尺度、视角和遮挡变化导致像素级匹配不稳定。现有几何感知模型如VGGT虽能实现特征对齐,但在密集预测任务中常因像素级投影漂移而失效。解决方案的关键在于提出VGGT-Segmentor(VGGT-S),其核心创新是引入一种三阶段联合分割头(Union Segmentation Head):首先融合掩码提示,其次通过点引导预测,最后进行迭代掩码优化,从而将高层特征对齐精准映射为像素级分割掩码;同时设计单图自监督训练策略,无需成对标注即可实现强泛化能力,显著提升跨视角分割性能,在Ego-Exo4D基准上达到67.7%和68.0%的平均交并比(IoU),超越此前最先进方法。
链接: https://arxiv.org/abs/2604.13596
作者: Yulu Gao,Bohao Zhang,Zongheng Tang,Jitong Liao,Wenjun Wu,Si Liu
机构: Beihang University (北京航空航天大学); Hangzhou International Innovation Institute of Beihang University (北京航空航天大学杭州国际创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level textntion remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT’s powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.
[CV-51] Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis
【速读】:该论文旨在解决多视角烟雾遮蔽场景下的图像恢复与新视角合成问题,核心挑战在于如何在逐帧生成式去烟(generative dehazing)过程中保持跨视图的一致性,以避免因单帧优化导致的渲染模糊和结构不稳定。解决方案的关键在于提出“先去烟再点绘”(Dehaze-then-Splat)两阶段框架:第一阶段利用Nano Banana Pro进行逐帧生成式去烟并结合亮度归一化生成伪干净图像;第二阶段在3D Gaussian Splatting(3DGS)训练中引入物理信息辅助损失——包括伪深度的皮尔逊相关性监督、暗通道先验正则化以及双源梯度匹配,有效补偿了帧级处理带来的跨视图不一致性,从而显著提升新视角合成质量。
链接: https://arxiv.org/abs/2604.13589
作者: Yuchao Chen,Hanqing Wang
机构: Huazhong University of Science and Technology (华中科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Dehaze-then-Splat, a two-stage pipeline for multi-view smoke removal and novel view synthesis developed for Track~2 of the NTIRE 2026 3D Restoration and Reconstruction Challenge. In the first stage, we produce pseudo-clean training images via per-frame generative dehazing using Nano Banana Pro, followed by brightness normalization. In the second stage, we train 3D Gaussian Splatting (3DGS) with physics-informed auxiliary losses – depth supervision via Pearson correlation with pseudo-depth, dark channel prior regularization, and dual-source gradient matching – that compensate for cross-view inconsistencies inherent in frame-wise generative processing. We identify a fundamental tension in dehaze-then-reconstruct pipelines: per-image restoration quality does not guarantee multi-view consistency, and such inconsistency manifests as blurred renders and structural instability in downstream 3D this http URL analysis shows that MCMC-based densification with early stopping, combined with depth and haze-suppression priors, effectively mitigates these artifacts. On the Akikaze validation scene, our pipeline achieves 20.98,dB PSNR and 0.683 SSIM for novel view synthesis, a +1.50,dB improvement over the unregularized baseline.
[CV-52] Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning
【速读】:该论文旨在解决基于视觉Transformer(Vision Transformer, ViT)的多视角三维目标检测方法在计算复杂度高、训练效率低的问题。现有最先进方法ToC3D虽采用基于自车运动的相关token选择策略以提升效率,但仍存在固定层间token选择比例导致训练与推理阶段计算资源利用率不足,且需对ViT骨干网络进行全端到端重训练的局限性。其解决方案的关键在于:提出一种图像token补偿机制与动态层级token选择策略相结合的方法,实现ViT骨干网络中各层token数量的自适应调整;同时引入参数高效的微调策略,仅训练新增模块(参数量从超过3亿降至160万),显著降低计算开销并保持检测精度提升。实验表明,该方法在NuScenes数据集上将计算复杂度降低48%–55%,推理延迟减少9%–25%,同时平均精度(mAP)和NuScenes检测分数分别提升1.0%–2.8%和0.4%–1.2%。
链接: https://arxiv.org/abs/2604.13586
作者: Danish Nazir,Antoine Hanna-Asaad,Lucas Görnhardt,Jan Piewek,Thorsten Bagdonat,Tim Fingscheidt
机构: Volkswagen AG (大众汽车集团); Technische Universität Braunschweig (不伦瑞克工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \textttToC3D for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \textttToC3D, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than 300 million (M) to only 1.6 M. Experiments on the large-scale NuScenes dataset across three multi-view 3D object detection approaches demonstrate that our proposed method decreases computational complexity (GFLOPs) by 48% … 55% , inference latency (on an \textttNVIDIA-GV100 GPU) by 9% … 25% , while still improving mean average precision by 1.0% … 2.8% absolute and NuScenes detection score by 0.4% … 1.2% absolute compared to so-far SOTA \textttToC3D.
[CV-53] SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance
【速读】:该论文旨在解决单目视频中近距离交互场景下人体重建的挑战,尤其是由严重相互遮挡导致的局部运动模糊、时间连续性破坏及空间关系错误等问题。解决方案的关键在于提出了一种基于扩散模型的框架SocialMirror,其核心创新包括:首先利用视觉-语言模型生成的高层交互描述引导语义引导的运动补全模块,以推断被遮挡的人体部位并消除局部姿态歧义;其次设计了一个序列级时间精修模块,在采样过程中引入几何约束,确保动作平滑无抖动的同时维持合理的接触关系与空间一致性。该方法在多个交互基准上实现了最先进的重建性能,展现出对未见数据集和真实场景的强大泛化能力。
链接: https://arxiv.org/abs/2604.13581
作者: Qi Xia,Peishan Cong,Ziyi Wang,Yujing Sun,Qin Sun,Xinge Zhu,Mao Ye,Ruigang Yang,Yuexin Ma
机构: ShanghaiTech University (上海科技大学); Nanyang Technological University (南洋理工大学); Guangzhou Institute of Energy Conversion, CAS (中国科学院广州能源研究所); University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学); Inceptio Technology (智驾科技); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.
[CV-54] Radar-Informed 3D Multi-Object Tracking under Adverse Conditions
【速读】:该论文旨在解决3D多目标跟踪(3D Multi-Object Tracking, 3D MOT)在真实场景中面临的鲁棒性不足问题,特别是在恶劣环境条件和远距离情况下目标检测与跟踪性能下降的问题。现有基于多模态融合的方法通常将雷达视为网络内部的一个特征学习模块,导致在复杂环境下雷达所提供的稳定感知优势被削弱。论文提出RadarMOT框架,其关键创新在于显式地利用雷达点云数据作为额外观测信息,用于优化状态估计并恢复远距离下的检测遗漏,从而提升跟踪的连续性和准确性。实验表明,在MAN-TruckScenes数据集上,RadarMOT在长距离和恶劣天气条件下分别实现了AMOTA指标绝对提升12.7%和10.3%。
链接: https://arxiv.org/abs/2604.13571
作者: Bingxue Xu,Emil Hedemalm,Ajinkya Khoche,Patric Jensfelt
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:The challenge of 3D multi-object tracking (3D MOT) is achieving robustness in real-world applications, for example under adverse conditions and maintaining consistency as distance increases. To overcome these challenges, sensor fusion approaches that combine LiDAR, cameras, and radar have emerged. However, existing multi-modal fusion methods usually treat radar as another learned feature inside the network. When the overall model degrades in difficult environmental conditions, the robustness advantages that radar could provide are also reduced. We propose RadarMOT, a radar-informed 3D MOT framework that explicitly uses radar point cloud data as additional observation to refine state estimation and recover detector misses at long ranges. Evaluations on the MAN-TruckScenes dataset show that RadarMOT consistently improves the Average Multi-Object Tracking Accuracy (AMOTA) with absolute 12.7% at long range and 10.3% in adverse weather. The code will be available at this https URL
[CV-55] ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing
【速读】:该论文旨在解决低空监测中宽带频谱感知的难题,其核心挑战在于异构协议、大带宽以及非平稳信噪比(SNR)导致的传统数据驱动方法在处理时频分辨率约束和频谱泄漏问题时表现不佳,进而影响窄带信号的可见性。解决方案的关键在于提出一种物理引导的粗到精框架 ZoomSpec,通过引入对数空间短时傅里叶变换(Log-Space STFT, LS-STFT)克服线性谱图的几何瓶颈,保持恒定相对分辨率的同时增强窄带结构;同时设计轻量级粗提案网络(Coarse Proposal Net, CPN)实现全带宽快速筛查,并利用自适应外差低通(Adaptive Heterodyne Low-Pass, AHLP)模块完成中心频率对齐、带宽匹配滤波与安全下采样,有效抑制带外干扰;最终由精细识别网络(Fine Recognition Net, FRN)融合净化后的时域I/Q信号与频谱幅值,借助双域注意力机制联合优化时间边界和调制分类,从而显著提升检测精度与稳定性。
链接: https://arxiv.org/abs/2604.13568
作者: Zhentao Yang,Yixiang Luomei,Zhuoyang Liu,Zhenyu Liu,Feng Xu
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 5 tables
Abstract:Wideband spectrum sensing for low-altitude monitoring is critical yet challenging due to heterogeneous protocols,large bandwidths, and non-stationary SNR. Existing data-driven approaches treat spectrograms as natural images,suffering from domain mismatch: they neglect time-frequency resolution constraints and spectral leakage, leading topoor narrowband visibility. This paper proposes ZoomSpec, a physics-guided coarse-to-fine framework integrating signal processing priors with deep learning. We introduce a Log-Space STFT (LS-STFT) to overcome the geometric bottleneck of linear spectrograms, sharpening narrowband structures while maintaining constant relative resolution. A lightweight Coarse Proposal Net (CPN) rapidly screens the full band. To bridge coarse detection and fine recognition, we design an Adaptive Heterodyne Low-Pass (AHLP) module that executes center-frequency aligning, bandwidth-matched filtering, and safe decimation, purifying signals of out-of-band interference. A Fine Recognition Net (FRN) fuses purified time-domain I/Q with spectral magnitude via dual-domain attention to jointly refine temporal boundaries and modulation classification. Evaluations on the SpaceNet real-world dataset demonstrate state-of-the-art 78.1 mAP@0.5:0.95, surpassing existing leaderboard systems with superior stability across diverse modulation bandwidths.
[CV-56] UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
【速读】:该论文旨在解决超高分辨率(Ultra-high-resolution, UHR)遥感影像中因空间尺度巨大导致的视觉令牌(visual tokens)数量呈二次爆炸式增长,从而阻碍小目标信息提取的问题。现有方法如直接下采样、密集切片或全局top-k剪枝,要么牺牲查询关键细节,要么带来不可预测的计算开销。其解决方案的关键在于提出一种查询引导且区域忠实的令牌压缩框架UHR-BAT:首先利用文本引导的多尺度重要性估计对视觉令牌进行精准筛选,实现高效且低成本的特征提取;其次通过区域级保留与合并策略减少冗余令牌,进一步降低计算预算,在多个基准测试中达到最优性能。
链接: https://arxiv.org/abs/2604.13565
作者: Yunkai Dang,Minxin Dai,Yuekun Yang,Zhangnan Li,Wenbin Li,Feng Miao,Yang Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at this https URL.
[CV-57] CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
【速读】:该论文旨在解决3D医学影像中视觉-语言模型(Vision-Language Model, VLM)训练时批次组成(batch composition)对学习表征质量的影响问题,尤其是针对正常与异常样本比例及数据规模变化下的性能表现。其关键解决方案在于揭示:相较于人为设计的类别平衡策略(如50:50采样),随机采样所引入的批次内多样性,配合模型自身在解剖子区域上的交替批处理机制,能在小批量(small batch size)条件下提供更有效的正则化效果,从而提升零样本诊断性能。实验表明,强制类平衡反而会显著降低模型性能,而自然随机采样则能更好地利用有限的3D医学数据资源。
链接: https://arxiv.org/abs/2604.13561
作者: Shivika,Kartik Bose,Pankaj Gupta
机构: PGIMER(全印度医学科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training with 20%, 40%, and 100% of the data. Performance scales sub-linearly from 65.26% to 71.88%, with individual findings varying dramatically in data sensitivity. Enforcing 50:50 balanced sampling on the same subset further degrades performance to 68.01%, confirming that explicit class balancing hurts regardless of dataset or balancing granularity. Our results indicate that the stochastic diversity of random sampling, combined with Merlin’s alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes.
[CV-58] AI Powered Image Analysis for Phishing Detection
【速读】:该论文旨在解决当前钓鱼网站日益依赖视觉仿冒(如复制logo、相似布局和配色)以规避基于文本和URL的检测系统的问题。解决方案的关键在于提出一种基于网页截图的深度学习检测框架,采用ConvNeXt-Tiny和Vision Transformer(ViT-Base)两种视觉模型进行图像级钓鱼页面识别,并通过迁移学习(使用ImageNet预训练权重)、阈值调优及多指标评估(精度、召回率、F1分数)实现更贴近实际部署环境的性能优化。实验表明,ConvNeXt-Tiny在F1分数和计算效率上均优于ViT-Base,凸显了卷积神经网络在视觉钓鱼检测中的优势,同时强调了决策阈值调整对平衡检测准确性和误报控制的重要性。
链接: https://arxiv.org/abs/2604.13555
作者: K. Acharya,S. Ale,R. Kadel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注: 8 pages, 3 figures
Abstract:Phishing websites now rely heavily on visual imitation-copied logos, similar layouts, and matching colours-to avoid detection by text- and URL-based systems. This paper presents a deep learning approach that uses webpage screenshots for image-based phishing detection. Two vision models, ConvNeXt-Tiny and Vision Transformer (ViT-Base), were tested to see how well they handle visually deceptive phishing pages. The framework covers dataset creation, preprocessing, transfer learning with ImageNet weights, and evaluation using different decision thresholds. The results show that ConvNeXt-Tiny performs the best overall, achieving the highest F1-score at the optimised threshold and running more efficiently than ViT-Base. This highlights the strength of convolutional models for visual phishing detection and shows why threshold tuning is important for real-world deployment. As future work, the curated dataset used in this study will be released to support reproducibility and encourage further research in this area. Unlike many existing studies that primarily report accuracy, this work places greater emphasis on threshold-aware evaluation to better reflect real-world deployment conditions. By examining precision, recall, and F1-score across different decision thresholds, the study identifies operating points that balance detection performance and false-alarm control. In addition, the side-by-side comparison of ConvNeXt-Tiny and ViT-Base under the same experimental setup offers practical insights into how convolutional and transformer-based architectures differ in robustness and computational efficiency for visual phishing detection.
[CV-59] Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation
【速读】:该论文旨在解决将二维手绘草图转换为三维模型的问题(2D freehand sketch to 3D model conversion),这是连接人类创造力与数字制造的关键挑战。传统方法依赖脆弱的符号逻辑或受限于刚性的参数化建模,导致用户只能使用预定义的CAD基本体。其解决方案的核心在于将重建任务建模为条件密集深度估计问题,并采用基于ControlNet结构的潜在扩散模型(Latent Diffusion Model, LDM)来处理正交投影固有的歧义性;同时引入一种基于图的BFS掩码策略以模拟部分深度线索,支持“草图-重构-再草图”的迭代工作流,从而实现从稀疏2D线稿到稠密3D表示的鲁棒映射,使用户无需受制于传统CAD的刚性约束即可“在三维空间中作画”。
链接: https://arxiv.org/abs/2604.13549
作者: Elton Cao,Hod Lipson
机构: Creative Machines Lab, Columbia University (哥伦比亚大学创意机器实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The conversion of 2D freehand sketches into 3D models remains a pivotal challenge in computer vision, bridging the gap between human creativity and digital fabrication. Traditional line drawing reconstruction relies on brittle symbolic logic, while modern approaches are constrained by rigid parametric modeling, limiting users to predefined CAD primitives. We propose a generative approach by framing reconstruction as a conditional dense depth estimation task. To achieve this, we implement a Latent Diffusion Model (LDM) with a ControlNet-style conditioning framework to resolve the inherent ambiguities of orthographic projections. To support an iterative “sketch-reconstruct-sketch” workflow, we introduce a graph-based BFS masking strategy to simulate partial depth cues. We train and evaluate our approach using a massive dataset of over one million image-depth pairs derived from the ABC Dataset. Our framework demonstrates robust performance across varying shape complexities, providing a scalable pipeline for converting sparse 2D line drawings into dense 3D representations, effectively allowing users to “draw in 3D” without the rigid constraints of traditional CAD.
[CV-60] Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)中存在的能力失衡问题,即模型在视觉理解任务中表现优异,但在生成任务中能力显著不足。其核心问题是模型内部丰富的知识在生成过程中未能被充分激活。解决方案的关键在于提出一种无需训练的统一修正链式思维框架(UniRect-CoT),该框架将扩散去噪过程视为内在的视觉推理过程,并通过对比中间结果与模型所理解的目标指令,构建自监督信号以持续修正生成过程中的中间结果,从而激活模型内部知识,提升生成质量。
链接: https://arxiv.org/abs/2604.13540
作者: Yibo Jiang,Tao Wu,Rui Jiang,Yehao Lu,Chaoxiang Cai,Zequn Qin,Xi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model’s rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the free lunch’’ hidden in the UMM’s powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during this http URL regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM this http URL experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.
[CV-61] Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization IJCNN2026
【速读】:该论文旨在解决通用机器人在复杂环境中适应性不足、跨任务泛化能力弱以及缺乏可解释性的难题,尤其针对传统方法依赖大量训练数据且难以从经验中持续学习的问题。其解决方案的关键在于提出一种可进化具身智能体(Evolvable Embodied Agent, EEAgent)框架,该框架利用大规模视觉-语言模型(Vision-Language Models, VLMs)提升环境理解与策略规划能力,并引入长短期反思优化(Long Short-Term Reflective Optimization, LSTRO)机制,动态地基于历史经验与新习得知识迭代优化提示(prompt),从而实现无需额外训练的持续自我演化,显著提升任务成功率,尤其在复杂场景下表现优越。
链接: https://arxiv.org/abs/2604.13533
作者: Jianzong Wang,Botao Zhao,Yayun He,Junqing Peng,Xulong Zhang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted for publication in the Proceedings of the 2026 International Joint Conference on Neural Networks (IJCNN 2026)
Abstract:Achieving general-purpose robotics requires empowering robots to adapt and evolve based on their environment and feedback. Traditional methods face limitations such as extensive training requirements, difficulties in cross-task generalization, and lack of interpretability. Prompt learning offers new opportunities for self-evolving robots without extensive training, but simply reflecting on past this http URL, extracting meaningful insights from task successes and failures remains a challenge. To this end, we propose the evolvable embodied agent (EEAgent) framework, which leverages large vision-language models (VLMs) for better environmental interpretation and policy planning. To enhance reflection on past experiences, we propose a long short-term reflective optimization (LSTRO) mechanism that dynamically refines prompts based on both past experiences and newly learned lessons, facilitating continuous self-evolution, thereby enhancing overall task success rates. Evaluations on six VIMA-Bench tasks reveal that our approach sets a new state-of-the-art, notably outperforming baselines in complex scenarios.
[CV-62] DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
【速读】:该论文旨在解决基于扩散模型的视频风格化方法在处理长视频时稳定性差、一致性不足,以及计算成本高、多步去噪导致难以实际应用的问题。其解决方案的关键在于提出了一种基于Diffusion Transformer(DiT)的实时重渲染框架RTR-DiT,通过在精心构建的视频风格化数据集上微调双向教师模型以支持文本引导和参考图像引导两种任务,并利用自强制(Self Forcing)和分布匹配蒸馏(Distribution Matching Distillation)技术将教师模型压缩为少步自回归模型;同时设计了参考保留的键值缓存(KV cache)更新策略,有效保障长视频处理的稳定性和一致性,并支持实时切换文本提示与参考图像,从而实现高效、高质量的实时视频风格化及交互式风格切换。
链接: https://arxiv.org/abs/2604.13509
作者: Hengye Lyu,Zisu Li,Yue Hong,Yueting Weng,Jiaxin Shi,Hanwang Zhang,Chen Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via post-training with Self Forcing and Distribution Matching Distillation. Furthermore, we propose a reference-preserving KV cache update strategy that not only enables stable and consistent processing of long videos, but also supports real-time switching between text prompts and reference images. Experimental results show that RTR-DiT outperforms existing methods in both text-guided and reference-guided video stylization tasks, in terms of quantitative metrics and visual quality, and demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.
[CV-63] Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling CVPR2026
【速读】:该论文旨在解决稀疏专家模型(Mixture-of-Experts, MoE)在从预训练稠密权重进行初始化时存在的专家对称性问题和早期专业化不足的问题。现有方法如Sparse Upcycling由于所有专家初始权重相同且路由器随机初始化,导致专家间缺乏差异化,难以快速适应数据分布。解决方案的关键在于提出一种聚类感知的上采样策略(Cluster-aware Upcycling):首先基于输入激活的语义结构将稠密模型的特征划分为若干语义簇;随后,每个专家通过对应簇的截断奇异值分解(truncated SVD)子空间表示进行初始化,并将路由器初始权重设为对应簇的中心点。这一策略打破了专家对称性并引导早期专业化,同时引入专家集成自蒸馏损失以稳定训练,从而显著提升零样本与少样本任务性能,并促进更解耦、多样化的专家表征。
链接: https://arxiv.org/abs/2604.13508
作者: Sanghyeok Chu,Pyunghwan Ahn,Gwangmo Song,SeungHwan Kim,Honglak Lee,Bohyung Han
机构: LG AI Research; Seoul National University; University of Michigan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Comments: Accepted to CVPR 2026. Project page: this https URL
Abstract:Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model’s input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router’s initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.
[CV-64] ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimers Disease Progression ICPR2026
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)在个体间进展异质性带来的挑战,尤其是如何实现基于受试者特异性条件的纵向磁共振成像(MRI)合成,以支持更精准的疾病进展评估。其解决方案的关键在于提出ADP-DiT——一种具有时间感知能力且由临床文本条件驱动的扩散Transformer模型:通过双文本编码器(OpenCLIP与T5)融合随访间隔、多维度人口统计学、诊断状态(CN/MCI/AD)及神经心理学信息为自然语言提示,借助交叉注意力机制和自适应层归一化实现细粒度控制;同时引入旋转位置嵌入(rotary positional embeddings)与预训练SDXL-VAE潜空间中的扩散过程,显著提升解剖结构保真度,最终在3,321例纵向3T T1加权扫描数据上实现了SSIM 0.8739和PSNR 29.32 dB,优于基线DiT模型,并准确捕捉了如脑室扩大和海马体萎缩等进展相关变化。
链接: https://arxiv.org/abs/2604.13495
作者: Juneyong Lee,Geonwoo Baek,Ikbeom Jang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, accepted to ICPR 2026
Abstract:Alzheimer’s disease (AD) progresses heterogeneously across individuals, motivating subject-specific synthesis of follow-up magnetic resonance imaging (MRI) to support progression assessment. While Diffusion Transformers (DiT), an emerging transformer-based diffusion model, offer a scalable backbone for image synthesis, longitudinal AD MRI generation with clinically interpretable control over follow-up time and participant metadata remains underexplored. We present ADP-DiT, an interval-aware, clinically text-conditioned diffusion transformer for longitudinal AD MRI synthesis. ADP-DiT encodes follow-up interval together with multi-domain demographic, diagnostic (CN/MCI/AD), and neuropsychological information as a natural-language prompt, enabling time-specific control beyond coarse diagnostic stages. To inject this conditioning effectively, we use dual text encoders-OpenCLIP for vision-language alignment and T5 for richer clinical-language understanding. Their embeddings are fused into DiT through cross-attention for fine-grained guidance and adaptive layer normalization for global modulation. We further enhance anatomical fidelity by applying rotary positional embeddings to image tokens and performing diffusion in a pre-trained SDXL-VAE latent space to enable efficient high-resolution reconstruction. On 3,321 longitudinal 3T T1-weighted scans from 712 participants (259,038 image slices), ADP-DiT achieves SSIM 0.8739 and PSNR 29.32 dB, improving over a DiT baseline by +0.1087 SSIM and +6.08 dB PSNR while capturing progression-related changes such as ventricular enlargement and shrinking hippocampus. These results suggest that integrating comprehensive, subject-specific clinical conditions with architectures can improve longitudinal AD MRI synthesis.
[CV-65] RadarSplat-RIO: Indoor Radar-Inertial Odometry with Gaussian Splatting-Based Radar Bundle Adjustment
【速读】:该论文旨在解决雷达同步定位与建图(Radar SLAM)中因依赖帧间里程计而导致的显著位姿漂移问题,尤其在缺乏回环检测条件下的长期精度不足。现有方法虽可借助回环闭合纠正累积误差,但受限于环境重访需求和鲁棒性差的场景识别能力。为突破此瓶颈,作者提出首个基于高斯点绘(Gaussian Splatting, GS)的雷达束调整(Radar Bundle Adjustment, BA)框架,其关键在于利用GS构建稠密且可微分的场景表示,从而实现对雷达传感器位姿与场景几何结构的联合优化——首次将多帧BA的优势引入雷达SLAM系统,显著提升了位姿估计的精度与鲁棒性,在多个室内场景下使平均绝对平移误差和旋转误差分别降低90%和80%。
链接: https://arxiv.org/abs/2604.13492
作者: Pou-Chun Kung,Yuan Tian,Zhengqin Li,Yue Liu,Eric Whitmire,Wolf Kienzle,Hrvoje Benko
机构: Meta Reality Labs (Meta现实实验室); University of Michigan (密歇根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radar is more resilient to adverse weather and lighting conditions than visual and Lidar simultaneous localization and mapping (SLAM). However, most radar SLAM pipelines still rely heavily on frame-to-frame odometry, which leads to substantial drift. While loop closure can correct long-term errors, it requires revisiting places and relies on robust place recognition. In contrast, visual odometry methods typically leverage bundle adjustment (BA) to jointly optimize poses and map within a local window. However, an equivalent BA formulation for radar has remained largely unexplored. We present the first radar BA framework enabled by Gaussian Splatting (GS), a dense and differentiable scene representation. Our method jointly optimizes radar sensor poses and scene geometry using full range-azimuth-Doppler data, bringing the benefits of multi-frame BA to radar for the first time. When integrated with an existing radar-inertial odometry frontend, our approach significantly reduces pose drift and improves robustness. Across multiple indoor scenes, our radar BA achieves substantial gains over the prior radar-inertial odometry, reducing average absolute translational and rotational errors by 90% and 80%, respectively.
[CV-66] Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning
【速读】:该论文旨在解决统一多模态大语言模型(Unified Multimodal Large Language Models, MLLMs)在文本到图像生成任务中缺乏细粒度控制的问题。现有基于多模态推理的图像生成方法主要依赖整体图像-文本对齐判断,未能对提示词中的具体语义单元(如实体和属性)进行逐项反思与优化,导致生成结果难以精准匹配复杂提示。解决方案的关键在于提出细粒度多模态推理(Fine-grained Multimodal Reasoning, FiMR)框架,通过分解视觉问答(Visual Question Answering, VQA)将输入提示拆解为最小语义单元,并利用VQA逐项验证这些单元,从而获得明确的细粒度反馈;在此基础上,FiMR实施针对性、局部化的修正策略,实现测试时图像-提示对齐精度与整体生成质量的显著提升。
链接: https://arxiv.org/abs/2604.13491
作者: Yongjin Kim,Yoonjin Oh,Yerin Kim,Hyomin Kim,Jeeyoung Yun,Yujung Heo,Minjun Kim,Sungwoong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. Therefore, we propose Fine-grained Multimodal Reasoning (FiMR), a framework that leverages decomposed visual question answering (VQA) to break down an input prompt into minimal semantic units-such as entities and attributes-and verify each unit via VQA to generate explicit, fine-grained feedback. Based on this feedback, FiMR then applies targeted, localized refinements. This fine-grained self-reasoning and self-refinement enable MLLMs to achieve more precise improvements in image-prompt alignment and overall generation quality at test time. Extensive experiments demonstrate that FiMR consistently outperforms image generation baselines, including reasoning-based methods, particularly on compositional text-to-image benchmarks.
[CV-67] RobotPan: A 360circ Surround-View Robotic Vision System for Embodied Perception
【速读】:该论文旨在解决机器人导航与操作中因视觉感知受限而导致的效率低下问题,特别是在人机协同场景(如远程操控、数据采集和紧急接管)下,现有系统通常依赖窄视野前向摄像头或需手动切换多摄像头画面,导致操作中断,并引发头戴式显示器中的运动诱导眩晕(motion-induced jitter)。其解决方案的关键在于提出一种基于六目相机与激光雷达(LiDAR)融合的360°全景视觉系统,并设计了名为\textscRobotPan的前馈式框架,该框架通过将多视角特征映射到统一球面坐标表示,利用分层球体体素先验解码出具有度量尺度的紧凑3D高斯分布,在保证重建精度的同时显著减少高斯数量,从而实现高效实时渲染、重建与流传输,同时支持长时间序列的在线融合更新,有效控制静态区域的冗余增长。
链接: https://arxiv.org/abs/2604.13476
作者: Jiahao Ma,Qiang Zhang,Peiran Liu,Zeran Su,Pihai Sun,Gang Han,Wen Zhao,Wei Cui,Zhang Zhang,Zhiyuan Xu,Renjing Xu,Jian Tang,Miaomiao Liu,Yijie Guo
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:Surround-view perception is increasingly important for robotic navigation and loco-manipulation, especially in human-in-the-loop settings such as teleoperation, data collection, and emergency takeover. However, current robotic visual interfaces are often limited to narrow forward-facing views, or, when multiple on-board cameras are available, require cumbersome manual switching that interrupts the operator’s workflow. Both configurations suffer from motion-induced jitter that causes simulator sickness in head-mounted displays. We introduce a surround-view robotic vision system that combines six cameras with LiDAR to provide full 360 ^\circ visual coverage, while meeting the geometric and real-time constraints of embodied deployment. We further present \textscRobotPan, a feed-forward framework that predicts \emphmetric-scaled and \emphcompact 3D Gaussians from calibrated sparse-view inputs for real-time rendering, reconstruction, and streaming. \textscRobotPan lifts multi-view features into a unified spherical coordinate representation and decodes Gaussians using hierarchical spherical voxel priors, allocating fine resolution near the robot and coarser resolution at larger radii to reduce computational redundancy without sacrificing fidelity. To support long sequences, our online fusion updates dynamic content while preventing unbounded growth in static regions by selectively updating appearance. Finally, we release a multi-sensor dataset tailored to 360 ^\circ novel view synthesis and metric 3D reconstruction for robotics, covering navigation, manipulation, and locomotion on real platforms. Experiments show that \textscRobotPan achieves competitive quality against prior feed-forward reconstruction and view-synthesis methods while producing substantially fewer Gaussians, enabling practical real-time embodied deployment. Project website: this https URL
[CV-68] MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection CVPR2026
【速读】:该论文旨在解决家禽肉品中木质肌(Woody Breast, WB)和意大利面状肌肉(Spaghetti Meat, SM)病变的低成本、非破坏性多类别分类问题,现有方法要么依赖主观人工评估,要么需要昂贵的实验室级成像设备。解决方案的关键在于提出MyoVision移动透射成像框架,通过消费级智能手机采集14位RAW图像,并提取反映内部组织异常的结构纹理特征;同时设计了一种NEATBoost-Attention集成模型,该模型基于神经进化优化权重融合LightGBM与注意力机制增强的多层感知机(MLP),利用NeuroEvolution of Augmenting Topologies (NEAT)自动发现超参数,无需人工调参,且适用于小规模表格数据集。实验表明,在336份商用加工鸡肉样本上达到82.4%测试准确率(F1=0.83),性能优于传统机器学习与深度学习基线,媲美成本高得多的高光谱成像系统,验证了消费级成像在可扩展内部组织质量评估中的可行性。
链接: https://arxiv.org/abs/2604.13456
作者: Chaitanya Pallerla,Siavash Mahmoudi,Dongyi Wang
机构: University of Arkansas (阿肯色大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 MetaFoods Workshop. 11 pages, 5 figures
Abstract:Woody Breast (WB) and Spaghetti Meat (SM) myopathies significantly impact poultry meat quality, yet current detection methods rely either on subjective manual evaluation or costly laboratory-grade imaging systems. We address the problem of low-cost, non-destructive multi-class myopathy classification using consumer smartphones. MyoVision is introduced as a mobile transillumination imaging framework in which 14-bit RAW images are captured and structural texture descriptors indicative of internal tissue abnormalities are extracted. To classify three categories (Normal, Woody Breast, Spaghetti Meat), we propose a NEATBoost-Attention Ensemble model, which is a neuroevolution-optimized weighted fusion of LightGBM and attention-based MLP models. Hyperparameters are automatically discovered using NeuroEvolution of Augmenting Topologies (NEAT), eliminating manual tuning and enabling architecture diversity for small tabular datasets. On a dataset of 336 fillets collected from a commercial processing facility, our method achieves 82.4% test accuracy (F1 = 0.83), outperforming conventional machine learning and deep learning baselines and matching performance reported by hyperspectral imaging systems costing orders of magnitude more. Beyond classification performance, MyoVision establishes a reproducible mobile RGB-D acquisition pipeline for multimodal meat quality research, demonstrating that consumer-grade imaging can support scalable internal tissue assessment.
[CV-69] A Study of Failure Modes in Two-Stage Human-Object Interaction Detection CVPR2026
【速读】:该论文旨在解决当前两阶段人类-物体交互(Human-Object Interaction, HOI)检测模型在复杂场景下表现不佳的问题,尤其是当涉及多人交互和稀有交互组合时,现有模型的失败模式缺乏系统性分析。其解决方案的关键在于将HOI检测任务分解为多个可解释的视角,并基于已有的HOI数据集构建按人-物-交互配置组织的子集(如多人交互、物体共享等),从而从不同场景构成维度系统分析模型行为与失败原因。这种方法揭示了高整体性能指标并不能反映模型对人类-物体关系的鲁棒视觉推理能力,为未来研究提供了重要洞见。
链接: https://arxiv.org/abs/2604.13448
作者: Lemeng Wang,Qinqian Lei,Vidhi Bakshi,Daniel Yi,Yifan Liu,Jiacheng Hou,Asher Seng Hao,Zheda Mai,Wei-Lun Chao,Robby T. Tan,Bo Wang
机构: The Ohio State University (俄亥俄州立大学); National University of Singapore (新加坡国立大学); Boston University (波士顿大学); Independent Researcher (独立研究员); University of Mississippi (密西西比大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to SAUAFG Workshop at CVPR 2026
Abstract:Human-object interaction (HOI) detection aims to detect interactions between humans and objects in images. While recent advances have improved performance on existing benchmarks, their evaluations mainly focus on overall prediction accuracy and provide limited insight into the underlying causes of model failures. In particular, modern models often struggle in complex scenes involving multiple people and rare interaction combinations. In this work, we present a study to better understand the failure modes of two-stage HOI models, which form the basis of many current HOI detection approaches. Rather than constructing a large-scale benchmark, we instead decompose HOI detection into multiple interpretable perspectives and analyze model behavior across these dimensions to study different types of failure patterns. We curate a subset of images from an existing HOI dataset organized by human-object-interaction configurations (e.g., multi-person interactions and object sharing), and analyze model behavior under these configurations to examine different failure modes. This design allows us to analyze how these HOI models behave under different scene compositions and why their predictions fail. Importantly, high overall benchmark performance does not necessarily reflect robust visual reasoning about human-object relationships. We hope that this study can provide useful insights into the limitations of HOI models and offer observations for future research in this area.
[CV-70] MaMe MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis CVPR2026
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 中自注意力机制因输入 token 数量庞大而导致的二次方复杂度问题,从而限制了模型在实际应用中的效率。现有方法如 ToMe 依赖于 GPU 不友好的操作(如排序和分散写入),引入额外开销并制约加速效果。其解决方案的关键在于提出 MaMe(Matrix-based Merging),一种无需训练、完全基于矩阵运算的可微分 token 合并方法,具备 GPU 高效性;同时引入其逆操作 MaRe(Matrix-based Restoration),构建 MaMe+MaRe 管道用于图像生成任务。该方案在不显著损失精度的前提下,显著提升 ViT 的吞吐量与推理速度,并在视频理解与图像合成等多场景中实现性能与速度的协同优化。
链接: https://arxiv.org/abs/2604.13432
作者: Simin Huo,Ning Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages. Extended version of CVPR 2026 Findings paper. Neurocomputing (Elsevier) under review
Abstract:Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe’s and MaRe’s effectiveness in accelerating vision models. The code is available at this https URLthis https URL.
[CV-71] A Unified Conditional Flow for Motion Generation Editing and Intra-Structural Retargeting
【速读】:该论文旨在解决文本驱动的动作编辑(text-driven motion editing)与结构内重定向(intra-structural retargeting)任务在传统方法中因输入和表示不兼容而造成的碎片化处理问题,其中源与目标角色共享拓扑但可能具有不同的骨骼长度。解决方案的关键在于提出一个统一的生成框架,将这两个任务建模为同一类条件传输(conditional transport)问题,仅通过推理阶段调节不同的条件信号——语义(text)或结构(skeletal)来区分任务。作者基于流匹配(flow matching)技术构建了一个联合受文本提示和目标骨骼结构条件约束的修正流运动模型(rectified-flow motion model),并通过引入基于DiT架构的每关节标记化(per-joint tokenization)与显式关节自注意力机制(explicit joint self-attention)严格建模运动学依赖关系,并采用多条件无分类器引导策略(multi-condition classifier-free guidance)平衡文本一致性与骨骼结构符合度,从而实现单一模型下零样本动作生成、编辑与结构重定向的统一支持。
链接: https://arxiv.org/abs/2604.13427
作者: Junlin Li,Xinhao Song,Siqi Wang,Haibin Huang,Yili Zhao
机构: ByteDance(字节跳动)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:Text-driven motion editing and intra-structural retargeting, where source and target share topology but may differ in bone lengths, are traditionally handled by fragmented pipelines with incompatible inputs and representations: editing relies on specialized generative steering, while retargeting is deferred to geometric post-processing. We present a unifying perspective where both tasks are cast as instances of conditional transport within a single generative framework. By leveraging recent advances in flow matching, we demonstrate that editing and retargeting are fundamentally the same generative task, distinguished only by which conditioning signal, semantic or structural, is modulated during inference. We implement this vision via a rectified-flow motion model jointly conditioned on text prompts and target skeletal structures. Our architecture extends a DiT-style transformer with per-joint tokenization and explicit joint self-attention to strictly enforce kinematic dependencies, while a multi-condition classifier-free guidance strategy balances text adherence with skeletal conformity. Experiments on SnapMoGen and a multi-character Mixamo subset show that a single trained model supports text-to-motion generation, zero-shot editing, and zero-shot intra-structural retargeting. This unified approach simplifies deployment and improves structural consistency compared to task-specific baselines.
[CV-72] Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking
【速读】:该论文旨在解决现有基于视觉Mamba的RGB-Event(RGBE)跟踪方法中因使用静态状态转移矩阵而导致的适应性不足问题,该问题在事件稀疏和密集场景下分别表现为模型欠拟合与过拟合,进而削弱了跨模态融合的鲁棒性。解决方案的关键在于提出一种动态状态空间模型(Dynamic State Space Model, DSSM),其核心包括两个创新:一是设计事件自适应的状态转移机制,通过可学习标量动态调节状态转移矩阵以区分稀疏与密集事件流的建模需求;二是构建门控投影融合(Gated Projection Fusion, GPF)模块,将RGB特征映射至事件特征空间,并利用事件密度和RGB置信度生成自适应门控信号,精确控制融合强度,在抑制噪声的同时保留互补信息。
链接: https://arxiv.org/abs/2604.13426
作者: Jinlin You,Muyu Li,Xudong Zhao
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing Vision Mamba-based RGB-Event(RGBE) tracking methods suffer from using static state transition matrices, which fail to adapt to variations in event sparsity. This rigidity leads to imbalanced modeling-underfitting sparse event streams and overfitting dense ones-thus degrading cross-modal fusion robustness. To address these limitations, we propose MambaTrack, a multimodal and efficient tracking framework built upon a Dynamic State Space Model(DSSM). Our contributions are twofold. First, we introduce an event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density. A learnable scalar governs the state evolution rate, enabling differentiated modeling of sparse and dense event flows. Second, we develop a Gated Projection Fusion(GPF) module for robust cross-modal integration. This module projects RGB features into the event feature space and generates adaptive gates from event density and RGB confidence scores. These gates precisely control the fusion intensity, suppressing noise while preserving complementary information. Experiments show that MambaTrack achieves state-of-the-art performance on the FE108 and FELT datasets. Its lightweight design suggests potential for real-time embedded deployment.
[CV-73] VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning
【速读】:该论文旨在解决视频色度-光照编辑(video chroma-lux editing)问题,即在保持视频结构和时间一致性的同时,对光照与颜色进行修改。现有方法通常依赖于昂贵的监督训练及合成配对数据,难以泛化且计算成本高。其解决方案的关键在于提出一种名为 VibeFlow 的自监督框架,通过利用预训练视频生成模型的内在物理理解能力,设计了一种解耦的数据扰动流水线,使模型能够自适应地从源视频中重组结构信息,并从参考图像中提取颜色-光照线索,从而实现无需标注数据的鲁棒解耦;同时引入残差速度场(Residual Velocity Fields)与结构失真一致性正则化,以修正流模型固有的离散化误差,保障结构保真与时间连贯性,最终在零样本场景下实现多样化的视频编辑任务,如重光照、重着色、低光增强等,显著降低计算开销并提升视觉质量。
链接: https://arxiv.org/abs/2604.13425
作者: Yifan Li,Pei Cheng,Bin Fu,Shuai Yang,Jiaying Liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video chroma-lux editing, which aims to modify illumination and color while preserving structural and temporal fidelity, remains a significant challenge. Existing methods typically rely on expensive supervised training with synthetic paired data. This paper proposes VibeFlow, a novel self-supervised framework that unleashes the intrinsic physical understanding of pre-trained video generation models. Instead of learning color and light transitions from scratch, we introduce a disentangled data perturbation pipeline that enforces the model to adaptively recombine structure from source videos and color-illumination cues from reference images, enabling robust disentanglement in a self-supervised manner. Furthermore, to rectify discretization errors inherent in flow-based models, we introduce Residual Velocity Fields alongside a Structural Distortion Consistency Regularization, ensuring rigorous structural preservation and temporal coherence. Our framework eliminates the need for costly training resources and generalizes in a zero-shot manner to diverse applications, including video relighting, recoloring, low-light enhancement, day-night translation, and object-specific color editing. Extensive experiments demonstrate that VibeFlow achieves impressive visual quality with significantly reduced computational overhead. Our project is publicly available at this https URL.
[CV-74] Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens
【速读】:该论文旨在解决非接触式电子屏幕内容泄露的安全问题,其核心挑战源于侧信道攻击中投影映射的近奇异雅可比谱导致的哈达玛稳定性破坏,以及光传输过程中的不可逆压缩对全局语义线索的抹除,从而加剧重建模糊性。解决方案的关键在于提出一种名为IR4Net(Irradiance Robust Radiometric Inversion Network)的神经网络架构,其创新性地融合了物理正则化的辐照度近似(PRIrr-Approximation),将辐射传输方程嵌入可学习优化器中以增强物理一致性;同时引入轮廓到细节的跨尺度重建机制以抑制噪声传播,并通过不可逆性约束的语义重投影(ICSR)模块,基于上下文驱动的语义映射恢复丢失的全局结构,从而在四种场景类别上实现超越现有方法的重建保真度与光照扰动鲁棒性。
链接: https://arxiv.org/abs/2604.13419
作者: Zhiwen Zheng,Yuheng Qiao,Xiaoshuai Zhang,Zhao Huang,Tao Zhang,Huiyu Zhou,Shaowei Jiang,Jin Liu,Wenwen Tang,Xingru Huang
机构: Hangzhou Dianzi University (杭州电子科技大学); Weidian Corporation (微店公司); Zhejiang Provincial Key Laboratory of Low Altitude Ubiquitous Networking Technology (浙江省低空泛在网络技术重点实验室); Ocean University of China (中国海洋大学); University of Aberdee (阿伯丁大学); University of Leicester (莱斯特大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Noncontact exfiltration of electronic screen content poses a security challenge, with side-channel incursions as the principal vector. We introduce an optical projection side-channel paradigm that confronts two core instabilities: (i) the near-singular Jacobian spectrum of projection mapping breaches Hadamard stability, rendering inversion hypersensitive to perturbations; (ii) irreversible compression in light transport obliterates global semantic cues, magnifying reconstruction ambiguity. Exploiting passive speckle patterns formed by diffuse reflection, our Irradiance Robust Radiometric Inversion Network (IR4Net) fuses a Physically Regularized Irradiance Approximation (PRIrr-Approximation), which embeds the radiative transfer equation in a learnable optimizer, with a contour-to-detail cross-scale reconstruction mechanism that arrests noise propagation. Moreover, an Irreversibility Constrained Semantic Reprojection (ICSR) module reinstates lost global structure through context-driven semantic mapping. Evaluated across four scene categories, IR4Net achieves fidelity beyond competing neural approaches while retaining resilience to illumination perturbations.
[CV-75] DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis
【速读】:该论文旨在解决当前 distractor-free radiance field 方法缺乏大规模、真实世界且包含干净与杂乱图像的基准数据集的问题,从而限制了方法在复杂场景下的鲁棒性评估与发展。其解决方案的关键在于提出 DF3DV-1K 数据集,该数据集包含 1,048 个场景、共 89,924 张由消费级相机拍摄的真实图像,涵盖 128 种干扰物类型和 161 种场景主题,同时提供系统化设计的子集 DF3DV-41 用于挑战性场景下的鲁棒性测试。通过该数据集,作者对九种近期 distractor-free radiance field 方法及 3D Gaussian Splatting 进行了全面 benchmarking,并进一步展示了其在改进生成式增强模型方面的应用潜力,显著提升了重建质量(如 PSNR 提升 0.96 dB)。
链接: https://arxiv.org/abs/2604.13416
作者: Cheng-You Lu,Yi-Shan Hung,Wei-Ling Chi,Hao-Ping Wang,Charlie Li-Ting Tsai,Yu-Cheng Chang,Yu-Lun Liu,Thomas Do,Chin-Teng Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches.
[CV-76] CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities
【速读】:该论文旨在解决深度学习模型在多模态脑肿瘤分割任务中因MRI数据不完整而导致的鲁棒性下降问题,其核心根源在于模态偏差(modality bias),即模型依赖于伪相关性作为捷径而非学习真实的解剖结构。解决方案的关键在于提出一种基于结构因果模型(Structural Causal Model, SCM)的框架CausalDisenSeg,通过因果引导的特征解耦与反事实推理实现鲁棒分割:首先利用条件变分自编码器(CVAE)结合HSIC约束显式解耦解剖因果因子与风格偏差因子;其次通过区域因果模块(Region Causality Module, RCM)将因果特征物理锚定至肿瘤区域;最后采用双对抗策略主动抑制残留的自然直接效应(Natural Direct Effect, NDE),迫使偏差的空间注意力与因果路径相互排斥,从而显著提升模型在缺失模态场景下的准确性和一致性。
链接: https://arxiv.org/abs/2604.13409
作者: Bo Liu,Yulong Zou,Jin Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In clinical practice, the robustness of deep learning models for multimodal brain tumor segmentation is severely compromised by incomplete MRI data. This vulnerability stems primarily from modality bias, where models exploit spurious correlations as shortcuts rather than learning true anatomical structures. Existing feature fusion methods fail to fundamentally eliminate this dependency. To address this, we propose CausalDisenSeg, a novel Structural Causal Model (SCM)-grounded framework that achieves robust segmentation via causality-guided disentanglement and counterfactual reasoning. We reframe the problem as isolating the anatomical Causal Factor from the stylistic Bias Factor. Our framework implements a three-stage causal intervention: (1) Explicit Causal Disentanglement: A Conditional Variational Autoencoder (CVAE) coupled with an HSIC constraint mathematically enforces statistical orthogonality between anatomical and style features. (2) Causal Representation Reinforcement: A Region Causality Module (RCM) explicitly grounds causal features in physical tumor regions. (3) Counterfactual Reasoning: A dual-adversarial strategy actively suppresses the residual Natural Direct Effect (NDE) of the bias, forcing its spatial attention to be mutually exclusive from the causal path. Extensive experiments on the BraTS 2020 dataset demonstrate that CausalDisenSeg significantly outperforms state-of-the-art methods in accuracy and consistency across severe missing-modality scenarios. Furthermore, cross-dataset evaluation on BraTS 2023 under the same protocol yields a state-of-the-art macro-average DSC of 84.49.
[CV-77] Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks ACL
【速读】:该论文旨在解决多模态大语言模型中上下文学习(In-context Learning, ICL)的机制不明确及其与纯文本ICL差异的问题,尤其关注在少样本(few-shot)场景下多模态ICL性能显著下降的原因。解决方案的关键在于通过分解多模态ICL为任务映射构建(task mapping construction)与任务映射迁移(task mapping transfer)两个阶段,揭示当前模型在视觉与文本表征之间缺乏推理层级对齐(reasoning-level alignment),导致任务映射无法可靠传递至查询样本;基于此发现,作者提出一种简单的推理阶段增强方法,强化任务映射的跨模态迁移能力,从而提升多模态ICL的有效性。
链接: https://arxiv.org/abs/2604.13403
作者: Yu Wang,Sharon Li
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACL Main 2026
Abstract:In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual representations, and fail to reliably transfer learned task mappings to queries. Guided by these findings, we further propose a simple inference-stage enhancement method that reinforces task mapping transfer. Our results provide new insights into the mechanisms and limitations of multimodal ICL and suggest directions for more effective multimodal adaptation. Our code is available \hrefthis https URLhere.
[CV-78] A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy
【速读】:该论文旨在解决质子治疗(Proton Therapy)中因解剖结构变化导致的变形图像配准(Deformable Image Registration, DIR)精度不足问题,尤其针对在线自适应放疗流程对速度与临床相关性双重需求的挑战。现有方法或计算效率低,或未充分利用临床信息,难以满足实际应用。解决方案的关键在于提出一种可临床扩展的“粗到精”多模态变形配准框架,通过双卷积神经网络(CNN)编码器提取层次特征,并结合基于Transformer的解码器逐步优化形变场;同时创新性地融合目标区、危及器官轮廓、剂量分布及治疗计划文本等临床先验信息,借助解剖引导注意力机制、文本条件特征调制和前景感知优化策略,实现解剖聚焦且临床语义明确的形变估计,从而在大规模多部位、多病种数据集上显著优于当前最优方法。
链接: https://arxiv.org/abs/2604.13397
作者: Caiwen Jiang,Yuzhen Ding,Mi Jia,Samir H. Patel,Terence T. Sio,Jonathan B. Ashman,Lisa A. McGee,Jean-Claude M. Rwigema,William G. Rule,Sameer R. Keole,Sujay A. Vora,William W. Wong,Nathan Y. Yu,Michele Y. Halyard,Steven E. Schild,Dinggang Shen,Wei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Proton therapy offers superior organ-at-risk sparing but is highly sensitive to anatomical changes, making accurate deformable image registration (DIR) across longitudinal CT scans essential. Conventional DIR methods are often too slow for emerging online adaptive workflows, while existing deep learning-based approaches are primarily designed for generic benchmarks and underutilize clinically relevant information beyond images. To address this gap, we propose a clinically scalable coarse-to-fine deformable registration framework that integrates multimodal information from the proton radiotherapy workflow to accommodate diverse clinical scenarios. The model employs dual CNN-based encoders for hierarchical feature extraction and a transformer-based decoder to progressively refine deformation fields. Beyond CT intensities, clinically critical priors, including target and organ-at-risk contours, dose distributions, and treatment planning text, are incorporated through anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization, enabling anatomically focused and clinically informed deformation estimation. We evaluate the proposed framework on a large-scale proton therapy DIR dataset comprising 1,222 paired planning and repeat CT scans across multiple anatomical regions and disease types. Extensive experiments demonstrate consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.
[CV-79] UniBlendNet: Unified Global Multi-Scale and Region-Adaptive Modeling for Ambient Lighting Normalization CVPR2026
【速读】:该论文旨在解决复杂空间变化光照条件下图像退化问题,即如何在不破坏图像结构和细节的前提下实现更自然、一致的光照归一化。现有方法如IFBlend虽利用频域先验建模光照变化,但仍受限于全局上下文建模能力不足与空间自适应性弱,导致在挑战性区域恢复效果不佳。解决方案的关键在于提出UniBlendNet统一框架,其核心创新包括:(1)基于UniConvNet模块增强全局光照理解,以捕捉长程依赖关系;(2)引入尺度感知聚合模块(Scale-Aware Aggregation Module, SAAM),通过金字塔式多尺度特征聚合与动态重加权机制更好地处理复杂光照变化;(3)设计掩码引导的残差精修机制,实现区域自适应校正,从而选择性增强曝光不良区域并保留良好曝光区域,显著提升光照一致性与结构保真度。
链接: https://arxiv.org/abs/2604.13383
作者: Jiatao Dai,Wei Dong,Han Zhou,Chengzhou Tang,Jun Chen
机构: McMaster University (麦克马斯特大学); University of Manitoba (曼尼托巴大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 NTIRE Workshop on New Trends in Image Restoration and Enhancement. 8 pages, 4 figures
Abstract:Ambient Lighting Normalization (ALN) aims to restore images degraded by complex, spatially varying illumination conditions. Existing methods, such as IFBlend, leverage frequency-domain priors to model illumination variations, but still suffer from limited global context modeling and insufficient spatial adaptivity, leading to suboptimal restoration in challenging regions. In this paper, we propose UniBlendNet, a unified framework for ambient lighting normalization that jointly models global illumination, multi-scale structures, and region-adaptive refinement. Specifically, we enhance global illumination understanding by integrating a UniConvNet-based module to capture long-range dependencies. To better handle complex lighting variations, we introduce a Scale-Aware Aggregation Module (SAAM) that performs pyramid-based multi-scale feature aggregation with dynamic reweighting. Furthermore, we design a mask-guided residual refinement mechanism to enable region-adaptive correction, allowing the model to selectively enhance degraded regions while preserving well-exposed areas. This design effectively improves illumination consistency and structural fidelity under complex lighting conditions. Extensive experiments on the NTIRE Ambient Lighting Normalization benchmark demonstrate that UniBlendNet consistently outperforms the baseline IFBlend and achieves improved restoration quality, while producing visually more natural and stable restoration results.
[CV-80] A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings
【速读】:该论文旨在解决放射治疗引起的正常组织损伤(radiotherapy-induced normal tissue injury)在医学影像中自动分割困难的问题,其核心挑战在于标注数据稀疏、损伤类型多样、病灶大小不一及成像模态差异显著。解决方案的关键在于提出一种基于3D SAM(Segment Anything Model)的渐进式提示框架,通过逐步引入三种互补提示机制实现多任务分割:文本提示用于任务感知适配,剂量引导的框提示用于粗略定位,点击提示用于迭代优化;同时设计小目标聚焦损失函数(small-target focus loss),提升对小而稀疏病灶的局部预测精度与边界分割能力。该方法在头颈部放射性损伤数据集(涵盖骨放射坏死ORN、脑水肿CE和脑放射性坏死CRN)上验证了优越性能。
链接: https://arxiv.org/abs/2604.13367
作者: Caiwen Jiang,Lei Zeng,Wei Liu
机构: Mayo Clinic(梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Radiotherapy-induced normal tissue injury is a clinically important complication, and accurate segmentation of injury regions from medical images could facilitate disease assessment, treatment planning, and longitudinal monitoring. However, automatic segmentation of these lesions remains largely unexplored because of limited voxel-level annotations and substantial heterogeneity across injury types, lesion size, and imaging modality. To address this gap, we curate a dedicated head-and-neck radiotherapy-induced normal tissue injury dataset covering three manifestations: osteoradionecrosis (ORN), cerebral edema (CE), and cerebral radiation necrosis (CRN). We further propose a 3D SAM-based progressive prompting framework for multi-task segmentation in limited-data settings. The framework progressively incorporates three complementary prompts: text prompts for task-aware adaptation, dose-guided box prompts for coarse localization, and click prompts for iterative refinement. A small-target focus loss is introduced to improve local prediction and boundary delineation for small and sparse lesions. Experiments on ORN, CE, and CRN demonstrate that the proposed method achieves reliable segmentation performance across diverse injury types and outperforms state-of-the-art methods.
[CV-81] Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface
【速读】:该论文旨在解决在资源受限的边缘计算平台上实现多智能体(Multi-Agent)协同对象检测与跟踪系统的集成难题,尤其关注如何在不依赖云端资源的情况下,通过生成式AI(Generative AI)驱动的自然语言接口提升系统开发效率和可扩展性。其解决方案的关键在于提出了一种基于事件驱动的消息交换子系统的多智能体架构,将本地运行的YOLO计算机视觉代理、Ollama大语言模型(LLM)报告代理以及Slack频道聊天机器人代理统一部署于单个树莓派(Raspberry Pi)硬件平台之上,从而实现端侧高效协作与快速原型验证,同时避免了传统专用通信控制接口的复杂性,并为未来轻量化边缘AI系统设计提供了实践参考。
链接: https://arxiv.org/abs/2604.13345
作者: Vladimir Kalušev,Branko Brkljač,Milan Brkljač
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures, 2 tables, implementation code will be made available upon manuscript publication
Abstract:The paper presents design and prototype implementation of an edge based object detection system within the new paradigm of AI agents orchestration. It goes beyond traditional design approaches by leveraging on LLM based natural language interface for system control and communication and practically demonstrates integration of all system components into a single resource constrained hardware platform. The method is based on the proposed multi-agent object detection framework which tightly integrates different AI agents within the same task of providing object detection and tracking capabilities. The proposed design principles highlight the fast prototyping approach that is characteristic for transformational potential of generative AI systems, which are applied during both development and implementation stages. Instead of specialized communication and control interface, the system is made by using Slack channel chatbot agent and accompanying Ollama LLM reporting agent, which are both run locally on the same Raspberry Pi platform, alongside the dedicated YOLO based computer vision agent performing real time object detection and tracking. Agent orchestration is implemented through a specially designed event based message exchange subsystem, which represents an alternative to completely autonomous agent orchestration and control characteristic for contemporary LLM based frameworks like the recently proposed OpenClaw. Conducted experimental investigation provides valuable insights into limitations of the low cost testbed platforms in the design of completely centralized multi-agent AI systems. The paper also discusses comparative differences between presented approach and the solution that would require additional cloud based external resources.
[CV-82] MSGS: Multispectral 3D Gaussian Splatting
【速读】:该论文旨在解决传统3D Gaussian Splatting (3DGS) 方法在视图合成中缺乏波长感知能力的问题,即其仅基于RGB信号进行优化,难以准确重建具有复杂光谱特性的场景(如半透明材质和各向异性反射)。解决方案的关键在于对每个高斯点引入频带相关的光谱辐射度(spectral radiance),通过每波段的球谐函数(spherical harmonics)表示,并采用RGB与多光谱信号联合监督的双损失优化策略;同时,在像素级执行光谱到RGB的转换,以保留更丰富的光谱线索用于优化过程。此方法在保持3DGS紧凑性和实时性的同时,显著提升了图像质量和光谱一致性,尤其在挑战性场景中表现优异。
链接: https://arxiv.org/abs/2604.13340
作者: Iris Zheng,Guojun Tang,Alexander Doronin,Paul Teal,Fang-Lue Zhang
机构: Victoria University of Wellington (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Published in IEEE ISMAR 2025 Adjunct
Abstract:We present a multispectral extension to 3D Gaussian Splatting (3DGS) for wavelength-aware view synthesis. Each Gaussian is augmented with spectral radiance, represented via per-band spherical harmonics, and optimized under a dual-loss supervision scheme combining RGB and multispectral signals. To improve rendering fidelity, we perform spectral-to-RGB conversion at the pixel level, allowing richer spectral cues to be retained during optimization. Our method is evaluated on both public and self-captured real-world datasets, demonstrating consistent improvements over the RGB-only 3DGS baseline in terms of image quality and spectral consistency. Notably, it excels in challenging scenes involving translucent materials and anisotropic reflections. The proposed approach maintains the compactness and real-time efficiency of 3DGS while laying the foundation for future integration with physically based shading models.
[CV-83] SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization
【速读】:该论文旨在解决现有语音驱动三维人脸动画方法在情感表达控制上的局限性问题,即以往方法依赖于句级或人工标注的情感标签,难以实现细粒度、连续的情感动态调节。其解决方案的关键在于引入帧级语音情感分割(frame-level speech emotion diarization),从语音中直接预测时序密集的情感类别与强度,并将这些情感信号编码为可学习嵌入(learned embeddings),以此作为条件输入至基于混合Transformer-Mamba架构的3D动画模型。该设计实现了语言内容与情感风格的有效解耦,同时保持身份一致性和时间连贯性,从而支持更自然、可控的高保真情感化三维说话头生成。
链接: https://arxiv.org/abs/2604.13335
作者: Farzaneh Jafari,Stefano Berretti,Anup Basu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages; 4 figures; conference
Abstract:We introduce SEDTalker, an emotion-aware framework for speech-driven 3D facial animation that leverages frame-level speech emotion diarization to achieve fine-grained expressive control. Unlike prior approaches that rely on utterance-level or manually specified emotion labels, our method predicts temporally dense emotion categories and intensities directly from speech, enabling continuous modulation of facial expressions over time. The diarized emotion signals are encoded as learned embeddings and used to condition a speech-driven 3D animation model based on a hybrid Transformer-Mamba architecture. This design allows effective disentanglement of linguistic content and emotional style while preserving identity and temporal coherence. We evaluate our approach on a large-scale multi-corpus dataset for speech emotion diarization and on the EmoVOCA dataset for emotional 3D facial animation. Quantitative results demonstrate strong frame-level emotion recognition performance and low geometric and temporal reconstruction errors, while qualitative results show smooth emotion transitions and consistent expression control. These findings highlight the effectiveness of frame-level emotion diarization for expressive and controllable 3D talking head generation.
[CV-84] SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting ICLR2026
【速读】:该论文旨在解决基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的物理光照重渲染(physically-based relighting)中,现有方法因粗粒度的着色分解导致材质属性建模不准确、物理可解释性差的问题,尤其在各向异性金属和半透明材料上表现不佳。解决方案的关键在于提出SSD-GS框架,通过将反射率分解为漫反射、镜面反射、阴影和次表面散射四个物理可解释的组件,并引入可学习的双极子散射模块用于次表面传输建模、融合可见性估计与精修网络的遮挡感知阴影公式,以及基于各向异性菲涅尔模型增强的镜面成分,从而在训练过程中逐步整合所有组件,有效解耦光照与材质属性,实现对未见光照条件下的高质量、高保真重渲染。
链接: https://arxiv.org/abs/2604.13333
作者: Iris Zheng,Guojun Tang,Alexander Doronin,Paul Teal,Fang-Lue Zhang
机构: Victoria University of Wellington (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to ICLR 2026. Code available at: this https URL
Abstract:We present SSD-GS, a physically-based relighting framework built upon 3D Gaussian Splatting (3DGS) that achieves high-quality reconstruction and photorealistic relighting under novel lighting conditions. In physically-based relighting, accurately modeling light-material interactions is essential for faithful appearance reproduction. However, existing 3DGS-based relighting methods adopt coarse shading decompositions, either modeling only diffuse and specular reflections or relying on neural networks to approximate shadows and scattering. This leads to limited fidelity and poor physical interpretability, particularly for anisotropic metals and translucent materials. To address these limitations, SSD-GS decomposes reflectance into four components: diffuse, specular, shadow, and subsurface scattering. We introduce a learnable dipole-based scattering module for subsurface transport, an occlusion-aware shadow formulation that integrates visibility estimates with a refinement network, and an enhanced specular component with an anisotropic Fresnel-based model. Through progressive integration of all components during training, SSD-GS effectively disentangles lighting and material properties, even for unseen illumination conditions, as demonstrated on the challenging OLAT dataset. Experiments demonstrate superior quantitative and perceptual relighting quality compared to prior methods and pave the way for downstream tasks, including controllable light source editing and interactive scene relighting. The source code is available at: this https URL.
[CV-85] Right Regions Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift ICLR2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在语义分割任务中因输入数据中非因果特征与目标标签之间的虚假相关性而导致的鲁棒性下降问题。其关键解决方案是提出一种名为“Flip”的简单诊断指标,用于量化模型在保持正确物体边界的同时,将前景像素错误分配为其他前景类别的“语义标签翻转”行为;同时引入一种基于熵的、无需真实标签的“flip-risk”评分,在推理阶段通过前景类别不确定性的度量来识别易发生标签翻转的情况,从而在分布偏移场景下更细致地评估分割模型的鲁棒性。
链接: https://arxiv.org/abs/2604.13326
作者: Akshit Achara,Yovin Yathathugoda,Nick Byrne,Michela Antonelli,Esther Puyol Anton,Alexander Hammers,Andrew P. King
机构: King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the CAO Workshop, ICLR 2026
Abstract:The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non-causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label-flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within-object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. We also propose an entropy-based, ground truth label-free `flip-risk’ score, which is computed from foreground identity uncertainty, and show that it can flag flip-prone cases at inference time. Code is available at this https URL.
[CV-86] owards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size Illumination Difference and Spatial Shift
【速读】:该论文旨在解决沥青路面松散(raveling)检测模型在真实世界部署中因数据多样性不足而导致性能下降的问题,尤其关注训练数据量、光照差异和空间位移等变量对模型鲁棒性的影响。其解决方案的关键在于提出RavelingArena基准测试平台,通过在现有数据集上引入受控的多样化变异而非收集新数据,实现对各类变化因素的量化评估,并基于实验结果优化模型训练策略,从而显著提升模型在多传感器、多环境条件下的泛化能力和年际一致性,为实际工程应用提供可靠的技术支撑。
链接: https://arxiv.org/abs/2604.13322
作者: Xinan Zhang,Haolin Wang,Zhongyu Yang,Yi-Chang(James)Tsai
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented in TRBAM 2026
Abstract:Raveling, the loss of aggregates, is a major form of asphalt pavement surface distress, especially on highways. While research has shown that machine learning and deep learning-based methods yield promising results for raveling detection by classification on range images, their performance often degrades in large-scale deployments where more diverse inference data may originate from different runs, sensors, and environmental conditions. This degradation highlights the need of a more generalizable and robust solution for real-world implementation. Thus, the objectives of this study are to 1) identify and assess potential variations that impact model robustness, such as the quantity of training data, illumination difference, and spatial shift; and 2) leverage findings to enhance model robustness under real-world conditions. To this end, we propose RavelingArena, a benchmark designed to evaluate model robustness to variations in raveling detection. Instead of collecting extensive new data, it is built by augmenting an existing dataset with diverse, controlled variations, thereby enabling variation-controlled experiments to quantify the impact of each variation. Results demonstrate that both the quantity and diversity of training data are critical to the accuracy of models, achieving at least a 9.2% gain in accuracy under the most diverse conditions in experiments. Additionally, a case study applying these findings to a multi-year test section in Georgia, U.S., shows significant improvements in year-to-year consistency, laying foundations for future studies on temporal deterioration modeling. These insights provide guidance for more reliable model deployment in raveling detection and other real-world tasks that require adaptability to diverse conditions.
[CV-87] Why MLLM s Struggle to Determine Object Orientations
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理二维物体方向推理任务时表现不佳的问题。现有研究假设此类失败源于视觉编码器(visual encoder)无法有效保留几何信息,例如物体方向。为验证这一假设,作者设计了一种受控的实证协议,通过训练线性回归模型从不同视觉编码器(如SigLIP、ViT和CLIP)提取的特征中恢复图像旋转角度,从而检验方向信息是否存在于编码表示中。关键发现是:尽管先前假设认为视觉编码器缺乏方向敏感性,但实验表明方向信息实际上可从编码特征中准确预测,说明MLLMs的方向推理失败并非源于编码器的信息缺失,而是可能由于模型未能有效利用这些分散在数万个特征中的方向信息。这一发现为理解MLLMs在几何推理上的局限性提供了新的视角。
链接: https://arxiv.org/abs/2604.13321
作者: Anju Gopinath,Nikhil Krishnaswamy,Bruce Draper
机构: Colorado State University (科罗拉多州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by training linear regressors to predict object orientation from encoded features. Contrary to the hypothesis, we find that orientation information is recoverable from encoder representations: simple linear models accurately predict object orientations from embeddings. This contradicts the assumption that MLLM orientation failures originate in the visual encoder. Having rejected the accepted hypothesis that MLLMs struggle with 2D orientation tasks because of visual encoder limitations, we still don’t know why they fail. Although a full explanation is beyond the scope of this paper, we show that although present, orientation information is spread diffusely across tens of thousands of features. This may or may not be while MLLMs fail to exploit the available orientation information. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.13321 [cs.CV] (or arXiv:2604.13321v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.13321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-88] he Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform
【速读】:该论文旨在解决当前城市参数监测数据在空间和时间分辨率上的局限性问题,现有方法如人工巡查、嵌入式传感、遥感或标准RGB街景图像(Street-view imagery)常受限于可扩展性差、时空分辨率不一致、视角为俯视或光谱信息不足等缺陷。解决方案的关键在于提出并开源了一个多光谱地表视角数据集(Multi-spectral terrestrial-view dataset),该数据集包含17,718张街景图像,融合了RGB、近红外(Near-infrared)和热成像(Thermal imaging)传感器数据,采集于荷兰不同城市形态区域(村庄、城镇、小城市和大城市),并通过严格的数据校准与高质量采集流程保障数据可靠性。此方案突破了传统方法的限制,为生成式AI(Generative AI)、城市规划与遥感领域的下游应用提供了高精度、多维度、开放获取的基准数据支持。
链接: https://arxiv.org/abs/2604.13315
作者: Akshit Gupta,Joris Timmermans,Filip Biljecki,Remko Uijlenhoet
机构: Delft University of Technology (代尔夫特理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted, under-review
Abstract:High-resolution data in spatial and temporal contexts is imperative for developing climate resilient cities. Current datasets for monitoring urban parameters are developed primarily using manual inspections, embedded-sensing, remote sensing, or standard street-view imagery (RGB). These methods and datasets are often constrained respectively by poor scalability, inconsistent spatio-temporal resolutions, overhead views or low spectral information. We present a novel method and its open implementation: a multi-spectral terrestrial-view dataset that circumvents these limitations. This dataset consists of 17,718 street level multi-spectral images captured with RGB, Near-infrared, and Thermal imaging sensors on bikes, across diverse urban morphologies (village, town, small city, and big urban area) in the Netherlands. Strict emphasis is put on data calibration and quality while also providing the details of our data collection methodology (including the hardware and software details). To the best of our knowledge, Spectrascapes is the first open-access dataset of its kind. Finally, we demonstrate two downstream use-cases enabled using this dataset and provide potential research directions in the machine learning, urban planning and remote sensing domains.
[CV-89] Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)无监督聚类中因数据复杂性和噪声干扰导致的聚类精度低的问题。其解决方案的关键在于提出了一种名为 Deep Spatially-Regularized Superpixel-based Diffusion Learning (DS²DL) 的新框架,该框架融合了基于掩码的深度表示学习与基于扩散的聚类机制:首先通过无监督掩码自编码器(Unsupervised Masked Autoencoder, UMAE)利用 Vision Transformer 架构提取去噪后的潜在表示,该过程充分考虑空间上下文和长程光谱相关性,并借助少量训练像素实现高效预训练;随后采用熵率超像素(Entropy Rate Superpixel, ERS)算法对图像进行分割,并在压缩的潜在空间中构建空间正则化的扩散图,从而更准确地反映数据流形的内在几何结构,显著提升聚类质量和标签准确性。
链接: https://arxiv.org/abs/2604.13307
作者: Vutichart Buranasiri,James M. Murphy
机构: Tufts University (塔夫茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear in IEEE IGARSS 2026
Abstract:An unsupervised framework for hyperspectral image (HSI) clustering is proposed that incorporates masked deep representation learning with diffusion-based clustering, extending the Spatially-Regularized Superpixel-based Diffusion Learning ( S^2DL ) algorithm. Initially, a denoised latent representation of the original HSI is learned via an unsupervised masked autoencoder (UMAE) model with a Vision Transformer backbone. The UMAE takes spatial context and long-range spectral correlations into account and incorporates an efficient pretraining process via masking that utilizes only a small subset of training pixels. In the next stage, the entropy rate superpixel (ERS) algorithm is used to segment the image into superpixels, and a spatially regularized diffusion graph is constructed using Euclidean and diffusion distances within the compressed latent space instead of the HSI space. The proposed algorithm, Deep Spatially-Regularized Superpixel-based Diffusion Learning ( DS^2DL ), leverages more faithful diffusion distances and subsequent diffusion graph construction that better reflect the intrinsic geometry of the underlying data manifold, improving labeling accuracy and clustering quality. Experiments on Botswana and KSC datasets demonstrate the efficacy of DS^2DL .
[CV-90] Bias at the End of the Score CVPR
【速读】:该论文旨在解决生成式 AI(Generative AI)中奖励模型(Reward Models, RMs)在文本到图像(Text-to-Image, T2I)生成流程中可能引入的性别与种族偏见问题,尤其是在奖励引导优化过程中对女性图像主体的过度性化及对多元群体的表征衰减。其解决方案的关键在于通过大规模审计揭示RMs作为评分函数时存在的系统性偏差:研究发现,原本设计用于衡量图像质量的RMs实际上编码了社会偏见,导致在训练和生成阶段强化性别/种族刻板印象并降低群体多样性;这一发现强调需改进数据收集与训练机制,以提升奖励模型的鲁棒性和公平性,从而保障生成内容的社会责任与可靠性。
链接: https://arxiv.org/abs/2604.13305
作者: Salma Abdel Magid,Grace Guo,Esin Tureci,Amaya Dharmasiri,Vikram V. Ramaswamy,Hanspeter Pfister,Olga Russakovsky
机构: Princeton University (普林斯顿大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:Reward models (RMs) are inherently non-neutral value functions designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of text-to-image (T2I) generation systems where they are used at various stages for dataset filtering, as evaluation metrics, as a supervisory signal during optimization of parameters, and for post-generation safety and quality filtering of T2I outputs. While specific problems with the integration of RMs into the T2I pipeline have been studied (e.g. reward hacking or mode collapse), their robustness and fairness as scoring functions remains largely unknown. We conduct a large scale audit of RM robustness with respect to demographic biases during T2I model training and generation. We provide quantitative and qualitative evidence that while originally developed as quality measures, RMs encode demographic biases, which cause reward-guided optimization to disproportionately sexualize female image subjects reinforce gender/racial stereotypes, and collapse demographic diversity. These findings highlight shortcomings in current reward models, challenge their reliability as quality metrics, and underscore the need for improved data collection and training procedures to enable more robust scoring.
[CV-91] Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
【速读】:该论文旨在解决视觉 Transformer(Vision Transformer, ViT)模型内部激活机制不透明、缺乏跨层计算结构理解的问题,尤其在如何准确刻画各层对最终表示的相对贡献方面存在不足。传统稀疏自编码器(Sparse Autoencoders, SAEs)仅作用于单层,无法捕捉 Transformer 的跨层信息流动和层级重要性差异。解决方案的关键在于引入跨层转换器(Cross-Layer Transcoder, CLT),其采用编码-解码架构,从先前层的稀疏嵌入中重建每一层后 MLP 模块的激活输出,从而将 ViT 的最终表示转化为一种可加且逐层解析的构造形式,实现对模型决策过程的忠实归因与细粒度解释。
链接: https://arxiv.org/abs/2604.13304
作者: Gerasimos Chatzoudis,Konstantinos D. Polyzos,Zhuowei Li,Difei Gu,Gemma E. Moran,Hao Wang,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学); University of California San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.
[CV-92] PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines
【速读】:该论文旨在解决现有面向机器的视频编码(Video Coding for Machines, VCM)方法因高度依赖特定下游任务和模型而导致的可扩展性差、难以适应多任务或模型更新的问题。其核心解决方案是提出一种即插即用的辅助令牌框架(Plug-and-Play Auxiliary-Token Framework, PAT-VCM),该框架通过保持一个共享的基础压缩码流,并引入轻量级的任务感知辅助令牌(auxiliary tokens)来增强信息恢复能力,从而实现不同下游任务在不重新训练独立编解码器的前提下获取所需信息。关键创新在于将辅助信息分为三类:视觉残差令牌、提示/控制令牌和语义令牌,有效支持分割、深度估计和语义识别等任务,在保证低比特率开销的同时显著提升性能,验证了共享压缩表示与轻量辅助令牌结合是一种可行且可扩展的VCM设计范式。
链接: https://arxiv.org/abs/2604.13294
作者: Wei Jiang,Wei Wang
机构: Futurewei Technologies Inc.(未来wei科技公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, 13 tables
Abstract:Existing video coding for machines is often trained for a specific downstream task and model. As a result, the compressed representation becomes tightly coupled to the end task, making it difficult to scale across multiple tasks or adapt to model updates. We propose PAT-VCM, a plug-and-play auxiliary-token framework for video coding for machines. PAT-VCM keeps a shared baseline compressed stream and augments it with lightweight task-aware auxiliary tokens, allowing different downstream tasks to recover the information they need without retraining a separate codec for each task. The framework supports three forms of auxiliary information: visual residual tokens, prompt/control tokens, and semantic tokens. We evaluate PAT-VCM on segmentation, depth estimation, and semantic recognition. A shared detection-oriented auxiliary branch provides a reusable first refinement, task-specific visual branches improve segmentation and depth, prompt tokens provide further segmentation gains at negligible bitrate, and semantic tokens achieve strong recognition performance with extremely low overhead. These results suggest that a shared compressed representation, combined with lightweight task-aware auxiliary tokens, is a practical and scalable alternative to tightly task-coupled VCM design.
[CV-93] SeeSay: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones
【速读】:该论文旨在解决复杂城市与郊区环境中无人机自主投递系统在包件投放阶段的安全性与可靠性问题,核心挑战在于如何准确识别并评估适合的投递区域(drop zone),而现有方法仅依赖几何分析或语义分割单一模态,缺乏融合多源信息进行鲁棒决策的能力。解决方案的关键在于提出SeeSay框架,其创新性地将几何安全线索(如单目深度梯度)与语义感知(如开放词汇检测掩码)通过视觉语言模型(Vision-Language Model, VLM)进行动态融合,实现对障碍物和风险区域的迭代式推理与优化;同时,VLM驱动的类别提示调整机制使系统能够在动态场景中持续修正检测结果,并在主投递区不可用时自动识别备选投递区域,从而显著提升整体投递安全性与适应性。
链接: https://arxiv.org/abs/2604.13292
作者: Mahyar Ghazanfari,Peng Wei
机构: George Washington University (乔治·华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop-offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry-based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision-making. To address this gap, we propose SeeSay, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision-Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the final delivery phase. When the primary drop-pad is occupied or unsafe, the proposed SeeSay also identifies alternative candidate zones for package delivery. We curated a dataset of urban delivery scenarios with moving objects and human activities to evaluate the approach. Experimental results show that SeeSay outperforms all baselines, achieving the highest accuracy and IoU for safety map prediction as well as superior performance in alternative drop zone evaluation across multiple thresholds. These findings highlight the promise of VLM-guided segmentation-depth fusion for advancing safe and practical drone-based package delivery.
[CV-94] Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition
【速读】:该论文旨在解决老年人跌倒检测中可解释性不足的问题,特别是现有事后解释方法在处理时序数据时产生的时空不稳定归因图,导致临床医生难以信任和应用。其关键解决方案是提出一种轻量级且可解释的基于骨架的跌倒检测框架,结合高效LSTM模型与T-SHAP(Temporal SHAP),后者是一种时序感知的事后聚合策略,通过在线性平滑算子对归因序列进行处理,在保留Shapley值理论保证(如局部准确性和一致性)的同时显著降低高频波动,从而生成稳定可靠的归因结果。实验表明该方法在NTU RGB+D数据集上实现94.3%分类准确率且推理延迟低于25毫秒,同时T-SHAP相较标准SHAP和Grad-CAM在扰动忠实度指标上表现更优,归因结果能稳定突出下肢不稳和脊柱姿态变化等生物力学相关运动模式,契合临床对跌倒机制的认知,具备在长期照护场景中部署的潜力。
链接: https://arxiv.org/abs/2604.13279
作者: Mohammad Saleh,Azadeh Tabatabaei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fall detection in elderly care requires not only accurate classification but also reliable explanations that clinicians can trust. However, existing post-hoc explainability methods, when applied frame-by-frame to sequential data, produce temporally unstable attribution maps that clinicians cannot reliably act upon. To address this issue, we propose a lightweight and explainable framework for skeleton-based fall detection that combines an efficient LSTM model with T-SHAP, a temporally aware post-hoc aggregation strategy that stabilizes SHAP-based feature attributions over contiguous time windows. Unlike standard SHAP, which treats each frame independently, T-SHAP applies a linear smoothing operator to the attribution sequence, reducing high-frequency variance while preserving the theoretical guarantees of Shapley values, including local accuracy and consistency. Experiments on the NTU RGB+D Dataset demonstrate that the proposed framework achieves 94.3% classification accuracy with an end-to-end inference latency below 25 milliseconds, satisfying real-time constraints on mid-range hardware and indicating strong potential for deployment in clinical monitoring scenarios. Quantitative evaluation using perturbation-based faithfulness metrics shows that T-SHAP improves explanation reliability compared to standard SHAP (AUP: 0.89 vs. 0.91) and Grad-CAM (0.82), with consistent improvements observed across five-fold cross-validation, indicating enhanced explanation reliability. The resulting attributions consistently highlight biomechanically relevant motion patterns, including lower-limb instability and changes in spinal alignment, aligning with established clinical observations of fall dynamics and supporting their use as transparent decision aids in long-term care environments
[CV-95] DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery
【速读】:该论文针对无人机(UAV)影像中目标检测面临的三大挑战:小目标频发、恶劣环境干扰及严格的计算资源限制,提出了一种系统性解决方案——DroneScan-YOLO。其核心创新在于四个协同设计:(1) 提升输入分辨率为1280×1280以增强小目标的空间细节;(2) 引入RPA-Block动态滤波剪枝机制,基于懒惰余弦相似度更新与10轮预热策略减少冗余参数;(3) 设计轻量级P2检测分支MSFD(stride=4),仅增加1.1%参数(+114,592)即可提升小目标召回;(4) 提出SAL-NWD混合损失函数,融合归一化Wasserstein距离与尺寸自适应CIoU权重,并嵌入YOLOv8的任务对齐分配机制。实验表明,该方法在VisDrone2019-DET数据集上mAP@50提升至55.3%(+16.6)、mAP@50-95达35.6%(+12.3),尤其显著改善了小目标检测性能(如自行车AP@50从0.114提升至0.328),同时保持96.7 FPS推理速度,参数增量仅+4.1%。
链接: https://arxiv.org/abs/2604.13278
作者: Yann V. Bellec
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 12 pages, 10 figures
Abstract:Aerial object detection in UAV imagery presents unique challenges due to the high prevalence of tiny objects, adverse environmental conditions, and strict computational constraints. Standard YOLO-based detectors fail to address these jointly: their minimum detection stride of 8 pixels renders sub-32px objects nearly undetectable, their CIoU loss produces zero gradients for non-overlapping tiny boxes, and their architectures contain significant filter redundancy. We propose DroneScan-YOLO, a holistic system contribution that addresses these limitations through four coordinated design choices: (1) increased input resolution of 1280x1280 to maximize spatial detail for tiny objects, (2) RPA-Block, a dynamic filter pruning mechanism based on lazy cosine-similarity updates with a 10-epoch warm-up period, (3) MSFD, a lightweight P2 detection branch at stride 4 adding only 114,592 parameters (+1.1%), and (4) SAL-NWD, a hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting, integrated into YOLOv8’s TaskAligned assignment pipeline. Evaluated on VisDrone2019-DET, DroneScan-YOLO achieves 55.3% mAP@50 and 35.6% mAP@50-95, outperforming the YOLOv8s baseline by +16.6 and +12.3 points respectively, improving recall from 0.374 to 0.518, and maintaining 96.7 FPS inference speed with only +4.1% parameters. Gains are most pronounced on tiny object classes: bicycle AP@50 improves from 0.114 to 0.328 (+187%), and awning-tricycle from 0.156 to 0.237 (+52%).
[CV-96] Rethinking Uncertainty in Segmentation: From Estimation to Decision
【速读】:该论文旨在解决医学图像分割中不确定性估计难以转化为实际决策的问题,即如何将不确定性图(uncertainty maps)有效转化为可操作的策略(如接受、标记或推迟预测),从而提升分割结果的安全性和可靠性。其核心解决方案是将分割任务建模为两阶段流程:先估计不确定性,再基于此做出决策;关键创新在于提出一种简单的置信度感知型推迟规则(confidence-aware deferral rule),该规则优先处理不确定且低置信度的预测区域,而非单纯优化不确定性本身。实验表明,该方法在仅推迟25%像素的情况下可消除高达80%的分割错误,并展现出良好的跨数据集鲁棒性,同时揭示了传统校准指标与真实决策性能之间的脱节,强调应以决策效果作为评估不确定性的标准。
链接: https://arxiv.org/abs/2604.13262
作者: Saket Maganti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 12 tables, 9 figures, Github repo: Saket-Maganti/medical-seg-uncertainity
Abstract:In medical image segmentation, uncertainty estimates are often reported but rarely used to guide decisions. We study the missing step: how uncertainty maps are converted into actionable policies such as accepting, flagging, or deferring predictions. We formulate segmentation as a two-stage pipeline, estimation followed by decision, and show that optimizing uncertainty alone fails to capture most of the achievable safety gains. Using retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1), we evaluate two uncertainty sources (Monte Carlo Dropout and Test-Time Augmentation) combined with three deferral strategies, and introduce a simple confidence-aware deferral rule that prioritizes uncertain and low-confidence predictions. Our results show that the best method and policy combination removes up to 80 percent of segmentation errors at only 25 percent pixel deferral, while achieving strong cross-dataset robustness. We further show that calibration improvements do not translate to better decision quality, highlighting a disconnect between standard uncertainty metrics and real-world utility. These findings suggest that uncertainty should be evaluated based on the decisions it enables, rather than in isolation.
[CV-97] 4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview CVPR2026
【速读】:该论文旨在解决海洋计算机视觉(Maritime Computer Vision)领域中模型预测准确性与嵌入式实时可行性之间的平衡问题,尤其在复杂海上场景下提升算法的实用性和部署效率。其解决方案的关键在于设计并实施五个基准挑战(benchmark challenges),涵盖多样化的任务场景,同时严格评估方法在精度与实时性能上的综合表现,并通过公开数据集、评分机制和领先团队的技术报告,系统性地揭示当前主流方法的趋势与实践经验,从而推动该领域从实验室研究向实际应用的转化。
链接: https://arxiv.org/abs/2604.13244
作者: Benjamin Kiefer,Jan Lukas Augustin,Jon Muhovič,Mingi Jeong,Arnold Wiliem,Janez Pers,Matej Kristan,Alberto Quattrini Li,Matija Teršek,Josip Šarić,Arpita Vats,Dominik Hildebrand,Rafia Rahim,Mahmut Karaaslan,Arpit Vaishya,Steve Xie,Ersin Kaya,Akib Mashrur,Tze-Hsiang Tang,Chun-Ming Tsai,Jun-Wei Hsieh,Ming-Ching Chang,Wonwoo Jo,Doyeon Lee,Yusi Cao,Lingling Li,Vinayak Nageli,Arshad Jamal,Gorthi Rama Krishna Sai Subrahmanyam,Jemo Maeng,Seongju Lee,Kyoobin Lee,Xu Liu,LiCheng Jiao,Jannik Sheikh,Martin Weinmann,Ivan Martinović,Jose Mateus Raitz Persch,Rahul Harsha Cheppally,Mehmet E. Belviranli,Dimitris Gahtidis,Hyewon Chun,Sangmun Lee,Philipp Gorczak,Hansol Kim,Jeeyeon Jeon,Borja Carrillo Perez,Jiahui Wang,Sangmin Park,Andreas Michel,Jannick Kuester,Bettina Felten,Wolfgang Gross,Yuan Feng,Justin Davis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to CVPR 2026 Workshop Proceeding; Maritime Computer Vision Workshop
Abstract:The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challenge analyses of emerging method trends. We also include technical reports from top-performing teams to highlight practical design choices and lessons learned across the benchmark suite. Datasets, leaderboards, and challenge resources are available at this https URL.
[CV-98] A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models
【速读】:该论文旨在解决深度学习物种分布模型(Species Distribution Models, SDMs)在提升预测性能的同时,难以提取可解释生态信息的问题。传统SDMs虽能有效预测物种空间分布,但复杂模型如卷积神经网络(Convolutional Neural Networks, CNNs)和视觉Transformer(Vision Transformer)的“黑箱”特性阻碍了对驱动因素的理解。为实现预测准确性与生态可解释性的平衡,作者提出首次将基于概念的可解释人工智能(Concept-based Explainable AI, XAI)应用于SDMs,其核心在于采用鲁棒TCAV(Robust TCAV)方法量化景观概念对模型预测的影响。关键创新在于构建了一个基于高分辨率多光谱与LiDAR无人机影像的开放获取景观概念数据集,包含15类景观概念共653个样本及1450个随机参考样本,使模型输出能够与专家知识对照验证,并揭示潜在生态关联,从而生成新的生态假说并支持政策制定与土地管理决策。
链接: https://arxiv.org/abs/2604.13240
作者: Augustin de la Brosse,Damien Garreau,Thomas Houet,Thomas Corpetti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.
[CV-99] SemiFA: An Agent ic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation
【速读】:该论文旨在解决半导体失效分析(Failure Analysis, FA)过程中人工耗时长、效率低的问题,传统FA需工程师耗费数小时进行图像检查、设备遥测数据关联、历史缺陷记录查询及结构化报告撰写。其解决方案的关键在于提出SemiFA——一个基于多智能体的端到端多模态框架,通过LangGraph构建四代理流水线:DefectDescriber(基于DINOv2和LLaVA-1.6实现缺陷形态分类与描述)、RootCauseAnalyzer(融合SECS/GEM设备遥测数据与向量数据库中相似历史缺陷进行根因推理)、SeverityClassifier(评估缺陷严重性并估算良率影响)、RecipeAdvisor(推荐工艺调整方案),最终由第五节点生成PDF报告。该系统在NVIDIA A100 GPU上可在48秒内完成完整FA报告生成,且实验证明引入设备遥测数据显著提升根因推理准确性(+0.86复合评分),是首个将SECS/GEM遥测集成至视觉语言模型流程用于自主FA报告生成的系统。
链接: https://arxiv.org/abs/2604.13236
作者: Shivam Chand Kaushik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 11 pages, 6 figures, 8 tables. Dataset available at this https URL . Code available at this https URL
Abstract:Semiconductor failure analysis (FA) requires engineers to examine inspection images, correlate equipment telemetry, consult historical defect records, and write structured reports, a process that can consume several hours of expert time per case. We present SemiFA, an agentic multi-modal framework that autonomously generates structured FA reports from semiconductor inspection images in under one minute. SemiFA decomposes FA into a four-agent LangGraph pipeline: a DefectDescriber that classifies and narrates defect morphology using DINOv2 and LLaVA-1.6, a RootCauseAnalyzer that fuses SECS/GEM equipment telemetry with historically similar defects retrieved from a Qdrant vector database, a SeverityClassifier that assigns severity and estimates yield impact, and a RecipeAdvisor that proposes corrective process adjustments. A fifth node assembles a PDF report. We introduce SemiFA-930, a dataset of 930 annotated semiconductor defect images paired with structured FA narratives across nine defect classes, drawn from procedural synthesis, WM-811K, and MixedWM38. Our DINOv2-based classifier achieves 92.1% accuracy on 140 validation images (macro F1 = 0.917), and the full pipeline produces complete FA reports in 48 seconds on an NVIDIA A100-SXM4-40 GB GPU. A GPT-4o judge ablation across four modality conditions demonstrates that multi-modal fusion improves root cause reasoning by +0.86 composite points (1-5 scale) over an image-only baseline, with equipment telemetry as the more load-bearing modality. To our knowledge, SemiFA is the first system to integrate SECS/GEM equipment telemetry into a vision-language model pipeline for autonomous FA report generation.
[CV-100] Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery
【速读】:该论文旨在解决行星着陆过程中宽视场影像的高精度三维地形重建难题,传统多视图立体匹配(Multi-View Stereo, MVS)方法在垂直下降、主要朝向天底的相机配置下,受限于强径向畸变和有限视差,导致深度范围受限且重建保真度不足,同时缺乏针对行星表面特性的领域先验信息。其解决方案的关键在于提出一种基于显式神经高度场(neural height field)表示的新方法,利用行星表面通常连续、平滑、致密且无漂浮物体的物理特性,引入强先验约束,从而显著提升空间覆盖范围并保持满意的估计精度,相较于传统MVS方法展现出更强的竞争力。
链接: https://arxiv.org/abs/2604.13235
作者: Melonie de Almeida,George Brydon,Divya M. Persaud,John H. Williamson,Paul Henderson
机构: University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital elevation modeling of planetary surfaces is essential for studying past and ongoing geological processes. Wide-angle imagery acquired during spacecraft descent promises to offer a low-cost option for high-resolution terrain reconstruction. However, accurate 3D reconstruction from such imagery is challenging due to strong radial distortion and limited parallax from vertically descending, predominantly nadir-facing cameras. Conventional multi-view stereo exhibits limited depth range and reduced fidelity under these conditions and also lacks domain-specific priors. We present the first study of modern neural reconstruction methods for planetary descent imaging. We also develop a novel approach that incorporates an explicit neural height field representation, which provides a strong prior since planetary surfaces are generally continuous, smooth, solid, and free from floating objects. This study demonstrates that neural approaches offer a strong and competitive alternative to traditional multi-view stereo (MVS) methods. Experiments on simulated descent sequences over high-fidelity lunar and Mars terrains demonstrate that the proposed approach achieves increased spatial coverage while maintaining satisfactory estimation accuracy.
[CV-101] Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG)
【速读】:该论文旨在解决体外受精(IVF)中囊胚质量评估的主观性强、胚胎学家间差异大以及质量控制标准化困难的问题。当前依赖形态学特征的视觉评估方法存在局限性,难以实现客观、一致的分级。解决方案的关键在于提出一种基于嵌入(embedding)的多任务学习方法,利用预训练的ResNet-18网络结构并引入嵌入层,从有限的第5天人类胚胎图像中提取生物和物理特征,自动识别滋养外胚层(trophectoderm, TE)和内细胞团(inner cell mass, ICM)区域及其等级,并同时预测囊胚扩张程度(blastocyst expansion, EXP)。该方法能够学习判别性表征,在视觉上相似且难以区分的TE与ICM结构识别中展现出鲁棒性和一致性,为囊胚质量自动化评估提供了可行路径。
链接: https://arxiv.org/abs/2604.13217
作者: Nahid Khoshk Angabini,Mohsen Tajgardan,Mahesh Madhavan,Zahra Asghari Varzaneh,Reza Khoshkangini,Thomas Ebner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable evaluation of blastocyst quality is critical for the success of in vitro fertilization (IVF) treatments. Current embryo grading practices primarily rely on visual assessment of morphological features, which introduces subjectivity, inter-embryologist variability, and challenges in standardizing quality assurance. In this study, we propose a multitask embedding-based approach for the automated analysis and prediction of key blastocyst components, including the trophectoderm (TE), inner cell mass (ICM), and blastocyst expansion (EXP). The method leverages biological and physical characteristics extracted from images of day-5 human embryos. A pretrained ResNet-18 architecture, enhanced with an embedding layer, is employed to learn discriminative representations from a limited dataset and to automatically identify TE and ICM regions along with their corresponding grades, structures that are visually similar and inherently difficult to distinguish. Experimental results demonstrate the promise of the multitask embedding approach and potential for robust and consistent blastocyst quality assessment.
[CV-102] owards Patient-Specific Deformable Registration in Laparoscopic Surgery
【速读】:该论文旨在解决通用外科手术中因术前与术中器官表面存在形变和噪声导致的患者特异性三维(3D)模型难以可靠配准的问题,从而影响术中可视化与解剖引导精度,增加并发症风险。其解决方案的关键在于提出首个基于Transformer架构的患者特异性非刚性点云配准方法,通过创新的数据生成策略优化个体化效果,并结合重叠估计模块与专用匹配模块预测密集对应关系,再利用物理驱动算法实现高精度注册,实验表明该方法在合成数据上达到45%的匹配分数(Matching Score)和92%的内点率(Inlier Ratio),显著优于传统无关个体的方法。
链接: https://arxiv.org/abs/2604.13186
作者: Alberto Neri,Veronica Penza,Nazim Haouchine,Leonardo S. Mattos
机构: Biomedical Robotics Lab, Istituto Italiano di Tecnologia, Genoa, Italy; Department of Computer Science, Bioengineering, Robotics and Systems Engineering (DIBRIS), University of Genoa, Italy; Harvard Medical School, Brigham and Women’s Hospital, Boston, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsafe surgical care is a critical health concern, often linked to limitations in surgeon experience, skills, and situational awareness. Integrating patient-specific 3D models into the surgical field can enhance visualization, provide real-time anatomical guidance, and reduce intraoperative complications. However, reliably registering these models in general surgery remains challenging due to mismatches between preoperative and intraoperative organ surfaces, such as deformations and noise. To overcome these challenges, we introduce the first patient-specific non-rigid point cloud registration method, which leverages a novel data generation strategy to optimize outcomes for individual patients. Our approach combines a Transformer encoder-decoder architecture with overlap estimation and a dedicated matching module to predict dense correspondences, followed by a physics-based algorithm for registration. Experimental results on both synthetic and real data demonstrate that our patient-specific method significantly outperforms traditional agnostic approaches, achieving 45% Matching Score with 92% Inlier Ratio on synthetic data, highlighting its potential to improve surgical care.
[CV-103] GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization
【速读】:该论文旨在解决**跨视角地理定位(cross-view geo-localization)**中因视点变化和域偏移导致的语义不一致性问题,从而实现无需GPS监督的泛化能力。现有方法依赖2D对应关系,易受冗余共享信息干扰,难以提取可迁移的特征表示。其解决方案的关键在于提出GeoLink框架,通过离线使用VGGT从多视角无人机图像重建场景点云,提供稳定的3D结构先验;在此基础上,设计两个互补模块:1)几何感知语义精炼模块(Geometric-aware Semantic Refinement),在3D引导下缓解2D特征中的冗余与视角偏差依赖;2)统一视图关系蒸馏模块(Unified View Relation Distillation),将3D结构关系迁移至2D特征中,增强跨视角对齐能力,同时保持纯2D推理流程。
链接: https://arxiv.org/abs/2604.13183
作者: Hongyang Zhang,Yinhao Liu,Haitao Zhang,Zhongyi Wen,Shuxian Liang,Xiansheng Hua
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Xiamen University (厦门大学); University of Electronic Science and Technology of China (电子科技大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Generalizable cross-view geo-localization aims to match the same location across views in unseen regions and conditions without GPS supervision. Its core difficulty lies in severe semantic inconsistency caused by viewpoint variation and poor generalization under domain shift. Existing methods mainly rely on 2D correspondence, but they are easily distracted by redundant shared information across views, leading to less transferable representations. To address this, we propose GeoLink, a 3D-aware semantic-consistent framework for Generalizable cross-view geo-localization. Specifically, we offline reconstruct scene point clouds from multi-view drone images using VGGT, providing stable structural priors. Based on these 3D anchors, we improve 2D representation learning in two complementary ways. A Geometric-aware Semantic Refinement module mitigates potentially redundant and view-biased dependencies in 2D features under 3D guidance. In addition, a Unified View Relation Distillation module transfers 3D structural relations to 2D features, improving cross-view alignment while preserving a 2D-only inference pipeline. Extensive experiments on multiple benchmarks show that GeoLink consistently outperforms state-of-the-art methods and achieves superior generalization across unseen domains and diverse weather environments.
[CV-104] 3DRealHead: Few-Shot Detailed Head Avatar
【速读】:该论文旨在解决当前3D头像重建方法在忠实还原用户身份特征和面部表情细节方面的局限性,尤其是在训练数据有限的情况下难以捕捉高度个性化区域(如嘴部与牙齿区域)的多样性问题,同时现有基于3D形态模型(3D Morphable Model, 3DMM)的表情控制机制表达能力受限。其解决方案的关键在于提出一种名为3DRealHead的新方法,通过一个新颖的少样本逆向生成流程来重建3D头像——该流程利用Style U-Net结构表示的3D人体头部先验,输出可渲染至新视角的3D高斯基元;此外,为提升动画表现力,该方法引入来自单目视频流的额外嘴部特征作为表达控制信号,从而弥补3DMM无法建模的复杂表情,实现更贴近物理现实的动态人脸再现。
链接: https://arxiv.org/abs/2604.13171
作者: Jalees Nehvi,Timo Bolkart,Thabo Beeler,Justus Thies
机构: Technical University of Darmstadt (达姆施塔特工业大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The human face is central to communication. For immersive applications, the digital presence of a person should mirror the physical reality, capturing the users idiosyncrasies and detailed facial expressions. However, current 3D head avatar methods often struggle to faithfully reproduce the identity and facial expressions, despite having multi-view data or learned priors. Learning priors that capture the diversity of human appearances, especially, for regions with highly person-specific features, like the mouth and teeth region is challenging as the underlying training data is limited. In addition, many of the avatar methods are purely relying on 3D morphable model-based expression control which strongly limits expressivity. To address these challenges, we are introducing 3DRealHead, a few-shot head avatar reconstruction method with a novel expression control signal that is extracted from a monocular video stream of the subject. Specifically, the subject can take a few pictures of themselves, recover a 3D head avatar and drive it with a consumer-level webcam. The avatar reconstruction is enabled via a novel few-shot inversion process of a 3D human head prior which is represented as a Style U-Net that emits 3D Gaussian primitives which can be rendered under novel views. The prior is learned on the NeRSemble dataset. For animating the avatar, the U-Net is conditioned on 3DMM-based facial expression signals, as well as features of the mouth region extracted from the driving video. These additional mouth features allow us to recover facial expressions that cannot be represented by the 3DMM leading to a higher expressivity and closer resemblance to the physical reality.
[CV-105] PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction CVPR
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)技术在多视角图像中可能引发的隐私泄露问题,即未经授权即可利用公开图像或视频重建出高保真三维模型。解决方案的关键在于提出一种轻量级的数据集投毒方法 PatchPoison:它在每张多视角图像的边缘注入一个小型高频对抗性补丁(结构化棋盘图案),该补丁通过引入虚假特征匹配点干扰结构光恢复(Structure-from-Motion, SfM)流程(如 COLMAP),导致相机位姿估计系统性偏移,从而使得后续 3DGS 优化无法收敛至真实场景几何结构。此方法无需修改现有重建流水线,可作为“即插即用”的预处理步骤,有效保护内容创作者的多视角数据隐私。
链接: https://arxiv.org/abs/2604.13153
作者: Prajas Wadekar,Venkata Sai Pranav Bachina,Kunal Bhosikar,Ankit Gangwal,Charu Sharma
机构: International Institute of Information Technology, Hyderabad, India
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: CVPR Workshop on Security, Privacy, and Adversarial Robustness in 3D Generative Vision Models (SPAR-3D), 2026
Abstract:3D Gaussian Splatting (3DGS) has recently enabled highly photorealistic 3D reconstruction from casually captured multi-view images. However, this accessibility raises a privacy concern: publicly available images or videos can be exploited to reconstruct detailed 3D models of scenes or objects without the owner’s consent. We present PatchPoison, a lightweight dataset-poisoning method that prevents unauthorized 3D reconstruction. Unlike global perturbations, PatchPoison injects a small high-frequency adversarial patch, a structured checkerboard, into the periphery of each image in a multi-view dataset. The patch is designed to corrupt the feature-matching stage of Structure-from-Motion (SfM) pipelines such as COLMAP by introducing spurious correspondences that systematically misalign estimated camera poses. Consequently, downstream 3DGS optimization diverges from the correct scene geometry. On the NeRF-Synthetic benchmark, inserting a 12 X 12 pixel patch increases reconstruction error by 6.8x in LPIPS, while the poisoned images remain unobtrusive to human viewers. PatchPoison requires no pipeline modifications, offering a practical, “drop-in” preprocessing step for content creators to protect their multi-view data.
[CV-106] Multi-modal panoramic 3D outdoor datasets for place categorization IROS IROS2026
【速读】:该论文旨在解决多模态全景三维室外场景(multi-modal panoramic 3D outdoor, MPO)的语义场景分类问题,即自动识别并区分六类典型环境:森林、海岸、住宅区、城区及室内外停车场。解决方案的关键在于构建两个高质量、公开可用的MPO数据集:第一个为密集点云数据集(900万点),使用FARO激光扫描仪获取同步彩色图像与反射率信息;第二个为稀疏实时点云数据集(7万点),基于Velodyne激光雷达在车辆行驶中采集,用于模拟真实动态场景。通过这两个互补的数据集,作者验证了多种语义分类方法,并实现了最高达96.42%(密集)和89.67%(稀疏)的分类准确率,证明了多模态融合与高质量标注数据对提升场景理解性能的重要性。
链接: https://arxiv.org/abs/2604.13142
作者: Hojung Jung,Yuki Oto,Oscar M. Mozos,Yumi Iwashita,Ryo Kurazume
机构: Kyushu University (九州大学); Polytechnic University of Cartagena (卡塔赫纳理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: This is the authors’ manuscript. The final published article was presented at IROS 2026, and it is available at this https URL
Abstract:We present two multi-modal panoramic 3D outdoor (MPO) datasets for semantic place categorization with six categories: forest, coast, residential area, urban area and indoor/outdoor parking lot. The first dataset consists of 650 static panoramic scans of dense (9,000,000 points) 3D color and reflectance point clouds obtained using a FARO laser scanner with synchronized color images. The second dataset consists of 34,200 real-time panoramic scans of sparse (70,000 points) 3D reflectance point clouds obtained using a Velodyne laser scanner while driving a car. The datasets were obtained in the city of Fukuoka, Japan and are publicly available in [1], [2]. In addition, we compare several approaches for semantic place categorization with best results of 96.42% (dense) and 89.67% (sparse).
[CV-107] Depth-Resolved Coral Reef Thermal Fields from Satellite SST and Sparse In-Situ Loggers Using Physics-Informed Neural Networks
【速读】:该论文旨在解决卫星海表温度(Sea Surface Temperature, SST)产品在珊瑚白化监测中因仅反映海洋表层温度而无法准确评估不同水深热应力的问题。由于珊瑚栖息深度可达20米以下,其所在水层温度常比表层低1–3°C,若直接将卫星SST应用于所有深度会导致热应激过估计。解决方案的关键在于提出一种物理信息神经网络(Physics-Informed Neural Network, PINN),该模型将NOAA珊瑚礁观测站的SST作为硬边界条件,并融合稀疏现场温 logger 数据,在一维垂直热传导方程框架下联合学习有效热扩散系数(κ)和光衰减系数(Kd),从而实现对非均匀深度剖面温度的高精度预测。
链接: https://arxiv.org/abs/2604.13131
作者: Alzayat Saleh,Mostafa Rahimi Azghadi
机构: James Cook University (詹姆斯库克大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 7 figures, submitted to Remote Sensing of Environment
Abstract:Satellite sea surface temperature (SST) products underpin global coral bleaching monitoring, yet they measure only the ocean skin. Corals inhabit depths from the shallows to beyond 20 metres, where temperatures can be 1-3°C cooler than the surface; applying satellite SST uniformly to all depths therefore overestimates subsurface thermal stress. We present a physics-informed neural network (PINN) that fuses NOAA Coral Reef Watch SST with sparse in-situ temperature loggers within the one-dimensional vertical heat equation, enforcing SST as a hard surface boundary condition and jointly learning effective thermal diffusivity (\kappa) and light attenuation (Kd). Validated across four Great Barrier Reef sites (30 holdout experiments), the PINN achieves 0.25-1.38°C RMSE at unseen depths. Under extreme sparsity (three training depths), the PINN maintains 0.27°C RMSE at the 5 metre holdout and 0.32°C at the 9.1 metre holdout, where statistical baselines collapse to 1.8°C; it outperforms a physics-only finite-difference baseline in 90% of experiments. Depth-resolved Degree Heating Day (DHD) profiles show that thermal stress attenuates with depth: at Davies Reef, DHD drops from 0.29 at the surface to zero by 10.7 metres, consistent with logger observations, while satellite DHD remains constant at 0.31 across all depths. However, the PINN underestimates absolute DHD at shallow depths because its smooth predictions attenuate the short-duration peaks that drive threshold exceedances; PINN DHD values should be interpreted as conservative lower bounds on depth-resolved stress. These results demonstrate that physics-constrained fusion of satellite SST with sparse loggers can extend bleaching assessment to the depth dimension using existing observational infrastructure.
[CV-108] Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models
【速读】:该论文旨在解决深度神经网络中对特定类别信息进行选择性且高效擦除的问题,这在隐私保护、合规性要求及自适应系统设计中至关重要。其核心挑战在于如何在不显著影响保留类性能的前提下,实现目标类信息的彻底、不可逆删除。解决方案的关键是提出图传播投影遗忘(Graph-Propagated Projection Unlearning, GPPU),该方法利用图结构在特征空间中识别特定类别的方向,并将表示投影到正交子空间,随后通过针对性微调确保目标类信息被有效移除,同时保持模型在其余类别上的性能。GPPU具有统一性和可扩展性,适用于视觉和音频模型,在多种架构(如CNN、Vision Transformer、Audio Transformer)上均表现出10–20倍于现有方法的速度提升,且验证了其在大规模场景下的有效性与普适性。
链接: https://arxiv.org/abs/2604.13127
作者: Shreyansh Pathak,Jyotishman Das
机构: Indian Institute of Technology Jodhpur (印度理工学院乔德普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:The need to selectively and efficiently erase learned information from deep neural networks is becoming increasingly important for privacy, regulatory compliance, and adaptive system design. We introduce Graph-Propagated Projection Unlearning (GPPU), a unified and scalable algorithm for class-level unlearning that operates across both vision and audio models. GPPU employs graph-based propagation to identify class-specific directions in the feature space and projects representations onto the orthogonal subspace, followed by targeted fine-tuning, to ensure that target class information is effectively and irreversibly removed. Through comprehensive evaluations on six vision datasets and two large-scale audio benchmarks spanning a variety of architectures including CNNs, Vision Transformers, and Audio Transformers, we demonstrate that GPPU achieves highly efficient unlearning, realizing 10-20x speedups over prior methodologies while preserving model utility on retained classes. Our framework provides a principled and modality-agnostic approach to machine unlearning, evaluated at a scale that has received limited attention in prior work, contributing toward more efficient and responsible deep learning.
[CV-109] A Lightweight Multi-Metric No-Reference Image Quality Assessment Framework for UAV Imaging
【速读】:该论文旨在解决无参考图像质量评估(No Reference Image Quality Assessment, NR-IQA)中的关键挑战,即在缺乏原始参考图像的情况下,如何高效且可靠地对大量自动采集的图像进行质量筛选。其解决方案的核心是提出了一种轻量级多指标图像质量评估框架(Multi-Metric Image Quality Assessment, MM-IQA),该框架融合了多种可解释的质量退化线索,包括模糊、边缘结构、低分辨率伪影、曝光不平衡、噪声、雾霾和频域内容等,综合生成一个0到100之间的单一质量评分。该方法在多个基准数据集上表现出良好的性能(SROCC值介于0.647至0.830之间),同时具备较低的计算复杂度(每张图像约1.97秒)和线性增长的内存占用,适用于实时图像质量筛查场景。
链接: https://arxiv.org/abs/2604.13112
作者: Koffi Titus Sergio Aglin,Anthony K. Muchiri,Celestin Nkundineza
机构: Pan African University Institute for Basic Sciences, Technology and Innovation (PAUSTI); Jomo Kenyatta University of Agriculture and Technology (JKUAT); University of Rwanda
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, article
Abstract:Reliable image quality assessment is essential in applications where large volumes of images are acquired automatically and must be filtered before further analysis. In many practical scenarios, a pristine reference image is unavailable, making no reference image quality assessment (NR-IQA) particularly important. This paper introduces Multi-Metric Image Quality Assessment (MM-IQA), a lightweight multi-metric framework for NR-IQA. It combines interpretable cues related to blur, edge structure, low resolution artifacts, exposure imbalance, noise, haze, and frequency content to produce a single quality score in the range [0,100].MM-IQA was evaluated on five benchmark datasets (KonIQ-10k, LIVE Challenge, KADID-10k, TID2013, and BIQ2021) and achieved SRCC values ranging from 0.647 to 0.830. Additional experiments on a synthetic agricultural dataset showed consistent behavior of the designed cues. The Python/OpenCV implementation required about 1.97 s per image. This method also has modest memory requirements because it stores only a limited number of intermediate grayscale, filtered, and frequency-domain representations, resulting in memory usage that scales linearly with image size. The results show that MM-IQA can be used for fast image quality screening with explicit distortion aware cues and modest computational cost.
[CV-110] Automatic Charge State Tuning of 300 mm FDSOI Quantum Dots Using Neural Network Segmentation of Charge Stability Diagram
【速读】:该论文旨在解决半导体量子点(Quantum Dots, QDs)中电荷自动调谐(charge auto-tuning)这一制约自旋量子比特(spin qubit)技术扩展的关键瓶颈问题。其核心解决方案是提出了一种基于深度学习(Deep Learning, DL)的语义分割流程,通过训练一个以MobileNetV2为编码器的U-Net结构卷积神经网络(Convolutional Neural Network, CNN),从完整的电荷稳定图(Charge Stability Diagrams, CSDs)中自动识别过渡线并定位单电荷区(single charge regime),进而输出目标栅极电压。该方法在包含1015张实验CSDs的大规模异构数据集上验证,实现了80.0%的整体离线调谐成功率,部分器件设计性能超过88%,同时通过分析失败模式提出了针对性改进策略,并展示了该方法在低温晶圆探针台中实现实时集成的可行性路径。
链接: https://arxiv.org/abs/2604.13662
作者: Peter Samaha,Amine Torki,Ysaline Renaud,Sam Fiette,Emmanuel Chanrion,Pierre-Andre Mortemousque,Yann Beilliard
机构: CEA-Leti, Univ. Grenoble Alpes, F-38000 Grenoble, France
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, supplementary materials available
Abstract:Tuning of gate-defined semiconductor quantum dots (QDs) is a major bottleneck for scaling spin qubit technologies. We present a deep learning (DL) driven, semantic-segmentation pipeline that performs charge auto-tuning by locating transition lines in full charge stability diagrams (CSDs) and returns gate voltage targets for the single charge regime. We assemble and manually annotate a large, heterogeneous dataset of 1015 experimental CSDs measured from silicon QD devices, spanning nine design geometries, multiple wafers, and fabrication runs. A U-Net style convolutional neural network (CNN) with a MobileNetV2 encoder is trained and validated through five-fold group cross validation. Our model achieves an overall offline tuning success of 80.0% in locating the single-charge regime, with peak performance exceeding 88% for some designs. We analyze dominant failure modes and propose targeted mitigations. Finally, wide-range diagram segmentation also naturally enables scalable physic-based feature extraction that can feed back to fabrication and design workflows and outline a roadmap for real-time integration in a cryogenic wafer prober. Overall, our results show that neural network (NN) based wide-diagram segmentation is a practical step toward automated, high-throughput charge tuning for silicon QD qubits.
[CV-111] Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention
【速读】:该论文旨在解决组织病理图像语义分割中因类别不平衡导致的性能下降问题,尤其是传统基于频率的损失重加权方法无法捕捉由形态多样性、边界模糊性和上下文相似性等因素引起的真正难度。其解决方案的关键在于提出动态焦点注意力(Dynamic Focal Attention, DFA),该机制在基于查询的掩码解码器的交叉注意力中直接学习类特定难度,通过引入可学习的每类偏置项作用于注意力logits,在预测前实现表示层面的再加权,而非传统的梯度层面重加权;DFA初始值基于对数频率先验以防止梯度饥饿,并通过端到端优化自适应捕获难度信号,从而统一频率驱动与难度感知策略,无需额外估计器或训练阶段即可显著提升Dice和IoU指标。
链接: https://arxiv.org/abs/2604.13479
作者: Lakmali Nadeesha Kumari,Sen-Ching Samson Cheung
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation of histopathology images under class imbalance is typically addressed through frequency-based loss reweighting, which implicitly assumes that rare classes are difficult. However, true difficulty also arises from morphological variability, boundary ambiguity, and contextual similarity-factors that frequency cannot capture. We propose Dynamic Focal Attention (DFA), a simple and efficient mechanism that learns class-specific difficulty directly within the cross-attention of query-based mask decoders. DFA introduces a learnable per-class bias to attention logits, enabling representation-level reweighting prior to prediction rather than gradient-level reweighting after prediction. Initialised from a log-frequency prior to prevent gradient starvation, the bias is optimised end-to-end, allowing the model to adaptively capture difficulty signals through training, effectively unifying frequency-based and difficulty-aware approaches under a common attention-bias framework. On three histopathology benchmarks (BDSA, BCSS, CRAG), DFA consistently improves Dice and IoU, matching or exceeding a difficulty-aware baseline without a separate estimator or additional training stage. These results demonstrate that encoding class difficulty at the representation level provides a principled alternative to conventional loss reweighting for imbalanced segmentation.
人工智能
[AI-0] LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
【速读】:该论文旨在解决当前语言模型在执行复杂自主任务时,长期链式思维(Chain-of-Thought, CoT)推理能力不足的问题,尤其是模型在处理需要数十万级推理标记的长程任务时表现出显著性能下降。解决方案的关键在于提出一个可扩展的基准测试 LongCoT,包含2,500个由专家设计的问题,覆盖化学、数学、计算机科学、国际象棋和逻辑等多个领域,每个问题均具有短输入和可验证答案,但求解过程需遍历由相互依赖步骤构成的图结构,从而直接测量前沿模型在长程CoT推理中的表现。该设计使得局部步骤对当前模型而言是可处理的,而失败则明确反映出模型在长程规划与管理上的局限性,为评估和改进模型的长期推理能力提供了严谨的量化标准。
链接: https://arxiv.org/abs/2604.14140
作者: Sumeet Ramesh Motwani,Daniel Nichols,Charles London,Peggy Li,Fabio Pizzati,Acer Blake,Hasan Hammoud,Tavish McDonald,Akshat Naik,Alesia Ivanova,Vignesh Baskaran,Ivan Laptev,Ruben Glatt,Tal Ben-Nun,Philip Torr,Natasha Jaques,Ameya Prabhu,Brian Bartoldson,Bhavya Kailkhura,Christian Schroeder de Witt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Long-Horizon Reasoning Benchmark
Abstract:As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve 10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
[AI-1] UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception
【速读】:该论文旨在解决原始通用操作接口(Universal Manipulation Interface, UMI)在真实场景中因依赖单目视觉SLAM(Simultaneous Localization and Mapping)而易受遮挡、动态场景和跟踪失败影响,导致数据采集鲁棒性和可扩展性不足的问题。解决方案的关键在于引入一个轻量级低成本的激光雷达(LiDAR)传感器并紧密集成于腕部佩戴式界面中,实现以LiDAR为中心的SLAM,从而在复杂环境下获得高精度的尺度感知位姿估计;同时构建了硬件同步的多模态传感流水线与统一时空标定框架,将视觉观测与LiDAR点云对齐,生成一致的3D演示表示,显著提升了数据质量与可靠性,并直接改善策略性能。
链接: https://arxiv.org/abs/2604.14089
作者: Ziming Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We present UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist-mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real-world environments. UMI-3D addresses these limitations by introducing a lightweight and low-cost LiDAR sensor tightly integrated into the wrist-mounted interface, enabling LiDAR-centric SLAM with accurate metric-scale pose estimation under challenging conditions. We further develop a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI-3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real-world experiments demonstrate that UMI-3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision-only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end-to-end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open-sourced to facilitate large-scale data collection and accelerate research in embodied intelligence: \hrefthis https URLthis https URL.
[AI-2] IP: Token Importance in On-Policy Distillation
【速读】:该论文旨在解决在线策略知识蒸馏(On-policy Knowledge Distillation, OPD)中token重要性评估不充分的问题,即现有方法未能全面识别哪些token能提供最具学习价值的信号。其关键解决方案是提出一个双轴分类框架TIP(Token Importance in on-Policy distillation),从**学生模型熵(student entropy)和教师-学生差异度(teacher–student divergence)**两个维度重新定义token的重要性:一方面,高熵位置通常包含不确定信息,具有较强的学习潜力;另一方面,低熵但高差异的位置代表学生过度自信且错误的情况,蕴含密集的纠正信号。实验表明,仅保留约10%的此类关键token即可接近全token训练性能,同时显著降低峰值内存消耗(最高达47%),并验证了该方法在多个大语言模型(Qwen3、Llama、Qwen2.5)与任务(MATH-500、AIME 2024/2025、DeepPlanning)上的有效性。
链接: https://arxiv.org/abs/2604.14084
作者: Yuanda Xu,Hejian Sang,Zhengze Zhou,Ran He,Zhipeng Wang,Alborz Geramifard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher–student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining 50% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47% . But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher–student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher–student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on 20% of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository this https URL, which supports memory-efficient distillation of larger models under limited GPU budgets. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14084 [cs.LG] (or arXiv:2604.14084v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.14084 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-3] First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs
【速读】:该论文旨在解决当前算法决策公平性研究中过度聚焦于预测空间(predictive space)而忽视决策转化过程及其对决策者(Decision-Maker, DM)与决策对象(Decision Subject, DS)福利影响的问题。传统方法通常以预测性能为代理指标,权衡基于群体属性的公平性约束(如人口均等或机会平等),但未建模预测如何转化为实际决策,并进一步影响不同社会群体的效用分配。论文提出了一种多利益相关方(multi-stakeholder)框架,基于福利经济学和分配正义理论,显式建模DM与DS的效用,并通过社会计划者的效用函数定义公平性(如功利主义、罗尔斯主义等)。其核心解决方案是将公平决策建模为后验多目标优化问题,在DM效用与社会计划者效用构成的二维效用空间中刻画可行的性能-公平权衡边界,同时比较确定性策略与随机策略在不同决策政策类别下的最优性条件。研究表明,在特定效用条件下,随机策略能通过利用结果不确定性实现更优的性能-公平平衡,从而推动从“预测导向”向“透明、正义驱动、多方协作”的决策公平范式转变。
链接: https://arxiv.org/abs/2604.14035
作者: Kavya Gupta,Nektarios Kalampalikis,Christoph Heitz,Isabel Valera
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 15 figures, to be published in FAccT 26
Abstract:Fairness in algorithmic decision-making is often defined in the predictive space, where predictive performance - used as a proxy for decision-maker (DM) utility - is traded off against prediction-based fairness notions, such as demographic parity or equality of opportunity. This perspective, however, ignores how predictions translate into decisions and ultimately into utilities and welfare for both DM and decision subjects (DS), as well as their allocation across social-salient groups. In this paper, we propose a multi-stakeholder framework for fair algorithmic decision-making grounded in welfare economics and distributive justice, explicitly modeling the utilities of both the DM and DS, and defining fairness via a social planner’s utility that captures inequalities in DS utilities across groups under different justice-based fairness notions (e.g., Egalitarian, Rawlsian). We formulate fair decision-making as a post-hoc multi-objective optimization problem, characterizing the achievable performance-fairness trade-offs in the two-dimensional utility space of DM utility and the social planner’s utility, under different decision policy classes (deterministic vs. stochastic, shared vs. group-specific). Using the proposed framework, we then identify conditions (in terms of the stakeholders’ utilities) under which stochastic policies are more optimal than deterministic ones, and empirically demonstrate that simple stochastic policies can yield superior performance-fairness trade-offs by leveraging outcome uncertainty. Overall, we advocate a shift from prediction-centric fairness to a transparent, justice-based, multi-stakeholder approach that supports the collaborative design of decision-making policies. Comments: 31 pages, 15 figures, to be published in FAccT 26 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.14035 [cs.LG] (or arXiv:2604.14035v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.14035 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805689.3806541 Focus to learn more DOI(s) linking to related resources
[AI-4] Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在电力系统运行控制中部署受限的问题,具体包括:严格的安全要求、在罕见扰动下的脆弱性以及对未见过的电网拓扑结构泛化能力差等挑战。解决方案的关键在于提出一种安全约束的分层控制框架,通过显式解耦长期决策与实时可行性保障:高层采用强化学习策略生成抽象控制动作,而运行时则由确定性的安全屏障(safety shield)利用快速前向仿真过滤不安全动作,从而在不依赖策略质量或训练分布的前提下,将安全性作为运行时不变量强制执行。该架构实现了更长的episode生存时间、更低的线路峰值负载,并在未重新训练的情况下成功零样本迁移至大规模未知电网,验证了通过架构设计而非复杂奖励工程可实现更好的安全性和泛化能力。
链接: https://arxiv.org/abs/2604.14032
作者: Gitesh Malik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 2 figures
Abstract:Reinforcement learning has shown promise for automating power-grid operation tasks such as topology control and congestion management. However, its deployment in real-world power systems remains limited by strict safety requirements, brittleness under rare disturbances, and poor generalization to unseen grid topologies. In safety-critical infrastructure, catastrophic failures cannot be tolerated, and learning-based controllers must operate within hard physical constraints. This paper proposes a safety-constrained hierarchical control framework for power-grid operation that explicitly decouples long-horizon decision-making from real-time feasibility enforcement. A high-level reinforcement learning policy proposes abstract control actions, while a deterministic runtime safety shield filters unsafe actions using fast forward simulation. Safety is enforced as a runtime invariant, independent of policy quality or training distribution. The proposed framework is evaluated on the Grid2Op benchmark suite under nominal conditions, forced line-outage stress tests, and zero-shot deployment on the ICAPS 2021 large-scale transmission grid without retraining. Results show that flat reinforcement learning policies are brittle under stress, while safety-only methods are overly conservative. In contrast, the proposed hierarchical and safety-aware approach achieves longer episode survival, lower peak line loading, and robust zero-shot generalization to unseen grids. These results indicate that safety and generalization in power-grid control are best achieved through architectural design rather than increasingly complex reward engineering, providing a practical path toward deployable learning-based controllers for real-world energy systems. Comments: 10 pages, 2 figures Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.14032 [cs.AI] (or arXiv:2604.14032v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.14032 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gitesh Malik [view email] [v1] Wed, 15 Apr 2026 16:11:10 UTC (142 KB)
[AI-5] MAny: Merge Anything for Multimodal Continual Instruction Tuning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在持续指令微调(Multimodal Continual Instruction Tuning, MCIT)过程中面临的灾难性遗忘问题,尤其揭示并应对感知漂移(perception drift)与推理崩溃(reasoning collapse)的双重遗忘现象——前者发生在跨模态投影空间中视觉表征的失准,后者则源于低秩参数空间中的任务间干扰。解决方案的关键在于提出MAny框架,通过两个核心模块实现无需训练的知识融合:其一是跨模态投影合并(Cross-modal Projection Merging, CPM),利用视觉原型引导自适应融合视觉特征以恢复感知对齐;其二是低秩参数合并(Low-rank Parameter Merging, LPM),基于递归最小二乘法递归合并低秩权重矩阵,提供数学上最优的融合轨迹以保障推理稳定性。MAny采用纯CPU上的代数运算完成知识合并,无需额外梯度优化,显著提升了多模态连续学习的效率与鲁棒性。
链接: https://arxiv.org/abs/2604.14016
作者: Zijian Gao,Wangwang Jia,Xingxing Zhang,Pengfei Qian,Tao Sun,Bo Ding,Yong Dou,Huaimin Wang,Kele Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbfMAny (\textbfMerge \textbfAnything), a framework that merges task-specific knowledge through \textbfCross-modal \textbfProjection \textbfMerging (\textbfCPM) and \textbfLow-rank \textbfParameter \textbfMerging (\textbfLPM). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57% and 2.85% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.
[AI-6] [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired Sensor-First Architecture for Physical AI
【速读】:该论文旨在解决物理人工智能(Physical AI)在资源受限设备(如机器人和可穿戴设备)中面临的性能瓶颈问题,即单纯依赖大规模模型扩展已无法满足实时性、能耗、隐私与可靠性等约束条件。其核心挑战在于:物理AI的性能不仅取决于模型容量,还高度依赖于动态环境中可控传感器对信号的获取质量。解决方案的关键是提出人工三元智能(Artificial Tripartite Intelligence, ATI)架构——一种生物启发的“传感器优先”系统设计,将系统划分为三个层级:L1脑干(Brainstem)负责反射式安全与信号完整性控制,L2小脑(Cerebellum)实现持续传感器校准,L3/L4大脑推理子系统支持技能选择、协调与深度推理。这种模块化结构使传感控制、自适应感知、边缘-云协同执行与基础模型推理能够在闭环架构中协同演进,同时确保关键传感与控制任务本地化执行,仅在必要时调用高层推理,从而显著提升端到端准确率并降低云端计算负载。
链接: https://arxiv.org/abs/2604.13959
作者: You Rim Choi,Subeom Park,Hyung-Sin Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI moves from data centers to robots and wearables, scaling ever-larger models becomes insufficient. Physical AI operates under tight latency, energy, privacy, and reliability constraints, and its performance depends not only on model capacity but also on how signals are acquired through controllable sensors in dynamic environments. We present Artificial Tripartite Intelligence (ATI), a bio-inspired, sensor-first architectural contract for physical AI. ATI is tripartite at the systems level: a Brainstem (L1) provides reflexive safety and signal-integrity control, a Cerebellum (L2) performs continuous sensor calibration, and a Cerebral Inference Subsystem spanning L3/L4 supports routine skill selection and execution, coordination, and deep reasoning. This modular organization allows sensor control, adaptive sensing, edge-cloud execution, and foundation model reasoning to co-evolve within one closed-loop architecture, while keeping time-critical sensing and control on device and invoking higher-level inference only when needed. We instantiate ATI in a mobile camera prototype under dynamic lighting and motion. In our routed evaluation (L3-L4 split inference), compared to the default auto-exposure setting, ATI (L1/L2 adaptive sensing) improves end-to-end accuracy from 53.8% to 88% while reducing remote L4 invocations by 43.3%. These results show the value of co-designing sensing and inference for embodied AI.
[AI-7] HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark
【速读】:该论文旨在解决代理系统(agent)在无外部攻击的良性条件下仍可能因内在故障(intrinsic failure)引发高后果风险的问题,即“内在风险”(intrinsic risk)的评估难题。现有研究多聚焦于外部诱导风险,而忽视了内在失败在长时程执行中潜伏、累积并最终导致严重后果的特性。解决方案的关键在于提出一种非攻击性内在风险审计方法(non-attack intrinsic risk auditing),并构建了HINTBench基准数据集,包含629条代理轨迹(523条有风险,106条安全,平均33步),支持风险检测、风险步骤定位和内在失败类型识别三项任务,其标注基于统一的五约束分类体系。实验表明,尽管大语言模型(LLM)在轨迹级风险检测上表现良好,但在细粒度的风险步骤定位(Strict-F1 < 35)和失败类型诊断上能力显著不足,且现有防护模型在此场景下迁移效果差,从而确立了内在风险审计作为代理安全领域的一项开放挑战。
链接: https://arxiv.org/abs/2604.13954
作者: Jiacheng Wang,Jinchang Hou,Fabian Wang,Ping Jian,Chenfu Bao,Zhonghou Lv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emphintrinsic risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emphnon-attack intrinsic risk auditing and present \textbfHINTBench, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.
[AI-8] AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
【速读】:该论文旨在解决科学同行评审(peer review)在投稿量激增背景下所面临的质量、一致性与时效性难以保障的问题。其核心解决方案在于首次实现了大规模实地部署的AI辅助同行评审:在AAAI-26会议中,所有主赛道投稿(共22,977篇)均获得了一个明确标识的AI生成评审意见。该方案的关键在于整合前沿大语言模型(Large Language Models, LLMs)、工具调用(tool use)和多重安全机制,在多阶段流程中实现高效、高质量的评审生成,且实证表明AI评审在技术准确性与研究建议等方面优于人工评审,验证了当前先进AI方法已具备在会议规模上对科研评审产生实质性贡献的能力。
链接: https://arxiv.org/abs/2604.13940
作者: Joydeep Biswas,Sheila Schoepp,Gautham Vasan,Anthony Opipari,Arthur Zhang,Zichao Hu,Sebastian Joseph,Matthew Lease,Junyi Jessy Li,Peter Stone,Kiri L. Wagstaff,Matthew E. Taylor,Odest Chadwicke Jenkins
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.
[AI-9] Beyond Conservative Automated Driving in Multi-Agent Scenarios via Coupled Model Predictive Control and Deep Reinforcement Learning
【速读】:该论文旨在解决无信号交叉路口自动驾驶中多车交互复杂、安全与效率难以平衡的问题。其解决方案的关键在于提出一种集成模型预测控制(Model Predictive Control, MPC)与深度强化学习(Deep Reinforcement Learning, RL)的混合框架(MPC-RL),通过MPC提供结构化的约束处理能力以保障安全性,同时利用RL从经验中学习自适应行为以提升效率,并借助MPC的物理可解释性增强跨场景泛化能力。实验表明,该框架在不同交通密度下均优于独立的MPC和端到端RL方法,在碰撞率降低21%的同时成功率达6.5%以上,且无需重新训练即可实现零样本迁移至高速汇入场景,验证了MPC组件对鲁棒性和训练稳定性的关键作用。
链接: https://arxiv.org/abs/2604.13891
作者: Saeed Rahmani,Gözde Körpe,Zhenlin(Gavin)Xu,Bruno Brito,Simeon Craig Calvert,Bart van Arem
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Automated driving at unsignalized intersections is challenging due to complex multi-vehicle interactions and the need to balance safety and efficiency. Model Predictive Control (MPC) offers structured constraint handling through optimization but relies on hand-crafted rules that often produce overly conservative behavior. Deep Reinforcement Learning (RL) learns adaptive behaviors from experience but often struggles with safety assurance and generalization to unseen environments. In this study, we present an integrated MPC-RL framework to improve navigation performance in multi-agent scenarios. Experiments show that MPC-RL outperforms standalone MPC and end-to-end RL across three traffic-density levels. Collectively, MPC-RL reduces the collision rate by 21% and improves the success rate by 6.5% compared to pure MPC. We further evaluate zero-shot transfer to a highway merging scenario without retraining. Both MPC-based methods transfer substantially better than end-to-end PPO, which highlights the role of the MPC backbone in cross-scenario robustness. The framework also shows faster loss stabilization than end-to-end RL during training, which indicates a reduced learning burden. These results suggest that the integrated approach can improve the balance between safety performance and efficiency in multi-agent intersection scenarios, while the MPC component provides a strong foundation for generalization across driving environments. The implementation code is available open-source.
[AI-10] GeoAgent Bench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的地理信息系统(Geographic Information Systems, GIS)代理在动态空间分析任务中评估困难的问题,尤其针对现有基准测试依赖静态文本或代码匹配、忽视运行时反馈及空间输出多模态特性所导致的评价失真。其解决方案的关键在于提出GeoAgentBench(GABench),一个面向工具增强型GIS代理的动态交互式评估基准,包含117个原子GIS工具和53项典型空间分析任务;并创新性地设计了“参数执行准确率”(Parameter Execution Accuracy, PEA)指标以量化隐式参数推理的保真度,结合视觉-语言模型(Vision-Language Model, VLM)进行数据空间精度与制图风格一致性验证;同时构建了Plan-and-React代理架构,通过解耦全局规划与逐步反应式执行来模拟专家认知流程,显著提升多步骤推理与错误恢复能力,在逻辑严谨性与执行鲁棒性之间实现最优平衡。
链接: https://arxiv.org/abs/2604.13888
作者: Bo Yu,Cheng Yang,Dongyang Hou,Chengfu Liu,Jiayao Liu,Chi Wang,Zhiming Zhang,Haifeng Li,Wentao Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures, 6 tables
Abstract:The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a “Last-Attempt Alignment” strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.
[AI-11] Evaluating Supervised Machine Learning Models: Principles Pitfalls and Metric Selection
【速读】:该论文旨在解决监督学习模型评估过程中因过度依赖少数汇总指标而导致对实际性能产生误导性结论的问题。其解决方案的关键在于将模型评估视为一个以决策为导向、情境依赖的过程,强调根据任务的运营目标选择合适的性能指标与验证协议,并系统性地考虑数据集特性、验证设计、类别不平衡、误差成本不对称等因素对评估结果的影响,从而构建统计上稳健、鲁棒且可信的监督学习系统。
链接: https://arxiv.org/abs/2604.13882
作者: Xuanyan Liu,Ignacio Cabrera Martin,Marcello Trovati,Xiaolong Xu,Nikolaos Polatidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The evaluation of supervised machine learning models is a critical stage in the development of reliable predictive systems. Despite the widespread availability of machine learning libraries and automated workflows, model assessment is often reduced to the reporting of a small set of aggregate metrics, which can lead to misleading conclusions about real-world performance. This paper examines the principles, challenges, and practical considerations involved in evaluating supervised learning algorithms across classification and regression tasks. In particular, it discusses how evaluation outcomes are influenced by dataset characteristics, validation design, class imbalance, asymmetric error costs, and the choice of performance metrics. Through a series of controlled experimental scenarios using diverse benchmark datasets, the study highlights common pitfalls such as the accuracy paradox, data leakage, inappropriate metric selection, and overreliance on scalar summary measures. The paper also compares alternative validation strategies and emphasizes the importance of aligning model evaluation with the intended operational objective of the task. By presenting evaluation as a decision-oriented and context-dependent process, this work provides a structured foundation for selecting metrics and validation protocols that support statistically sound, robust, and trustworthy supervised machine learning systems.
[AI-12] MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems
【速读】:该论文旨在解决基于Model Context Protocol (MCP)的智能体系统(agentic systems)所面临的新一代安全威胁问题,这些问题超出了现有安全框架的应对能力。解决方案的关键在于提出并实现MCPThreatHive——一个开源平台,其核心创新包括:自动化端到端的威胁情报生命周期管理,涵盖多源数据持续采集、AI驱动的威胁提取与分类、结构化知识图谱存储及交互式可视化;同时,该平台实现了MCP-38威胁分类体系,将38种MCP特有攻击模式映射至STRIDE、LLM应用OWASP Top 10及智能体应用OWASP Top 10,结合复合风险评分模型实现定量优先级排序,从而填补了现有工具在组合攻击建模完整性、持续威胁情报更新和统一多框架分类方面的三大关键空白。
链接: https://arxiv.org/abs/2604.13849
作者: Yi Ting Shen,Kentaroh Toyoda,Alex Leung
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: A white paper of our presentation at DEFCON SG 2026 (Demo Labs) this https URL
Abstract:The rapid proliferation of Model Context Protocol (MCP)-based agentic systems has introduced a new category of security threats that existing frameworks are inadequately equipped to address. We present MCPThreatHive, an open-source platform that automates the end-to-end lifecycle of MCP threat intelligence: from continuous, multi-source data collection through AI-driven threat extraction and classification, to structured knowledge graph storage and interactive visualization. The platform operationalizes the MCP-38 threat taxonomy, a curated set of 38 MCP-specific threat patterns mapped to STRIDE, OWASP Top 10 for LLM Applications, and OWASP Top 10 for Agentic Applications. A composite risk scoring model provides quantitative prioritization. Through a comparative analysis of representative existing MCP security tools, we identify three critical coverage gaps that MCPThreatHive addresses: incomplete compositional attack modeling, absence of continuous threat intelligence, and lack of unified multi-framework classification.
[AI-13] SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention
【速读】:该论文旨在解决稀疏注意力(sparse attention)在长序列大语言模型(LLM)训练中因序列长度异构性和稀疏敏感性差异导致的分布式训练负载不均衡问题,从而影响模型精度和系统效率。其解决方案的关键在于提出一种算法-系统协同设计框架 SparseBalance:首先通过工作负载感知的动态稀疏度调节机制,采用双向稀疏调整策略消除慢节点(stragglers)并利用固有“气泡”(bubbles)提升精度;其次设计稀疏度感知的批处理策略实现粗粒度负载平衡,与动态稀疏调节形成互补,最终在 LongBench 基准上实现最高 1.33× 的端到端加速同时提升 0.46% 的长文本建模能力。
链接: https://arxiv.org/abs/2604.13847
作者: Hongtao Xu,Jianchao Tan,Yuxuan Hu,Pengju Lu,Hongyu Wang,Pingwei Sun,Yerui Sun,Yuchen Xie,Xunliang Cai,Mingzhen Li,Weile Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit1) sequence length and \textit2) sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance, which complements dynamic sparsity tuning. Experimental results demonstrate that SparseBalance achieves up to a 1.33 \times end-to-end speedup while still improving the long-context capability by 0.46% on the LongBench benchmark.
[AI-14] Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go?
【速读】:该论文旨在解决软件工程领域情感分析中因标注数据稀缺而导致的模型训练困难问题。其解决方案的关键在于引入零样本学习(Zero-Shot Learning, ZSL)技术,通过利用专家标注的标签与嵌入式(embedding-based)或生成式(generative-based)ZSL模型相结合的方法,在无需大量标注数据的情况下实现与微调后的Transformer模型相当的情感分类性能,从而有效缓解标注数据获取难的问题。
链接: https://arxiv.org/abs/2604.13826
作者: Reem Alfayez,Manal Binkhonain
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Sentiment analysis in software engineering focuses on understanding emotions expressed in software artifacts. Previous research highlighted the limitations of applying general off-the-shelf sentiment analysis tools within the software engineering domain and indicated the need for specialized tools tailored to various software engineering contexts. The development of such tools heavily relies on supervised machine learning techniques that necessitate annotated datasets. Acquiring such datasets is a substantial challenge, as it requires domain-specific expertise and significant effort. Objective: This study explores the potential of ZSL to address the scarcity of annotated datasets in sentiment analysis within software engineering Method: We conducted an empirical experiment to evaluate the performance of various ZSL techniques, including embedding-based, NLI-based, TARS-based, and generative-based ZSL techniques. We assessed the performance of these techniques under different labels setups to examine the impact of label configurations. Additionally, we compared the results of the ZSL techniques with state-of-the-art fine-tuned transformer-based models. Finally, we performed an error analysis to identify the primary causes of misclassifications. Results: Our findings demonstrate that ZSL techniques, particularly those combining expert-curated labels with embedding-based or generative-based models, can achieve macro-F1 scores comparable to fine-tuned transformer-based models. The error analysis revealed that subjectivity in annotation and polar facts are the main contributors to ZSL misclassifications. Conclusion: This study demonstrates the potential of ZSL for sentiment analysis in software engineering. ZSL can provide a solution to the challenge of annotated dataset scarcity by reducing reliance on annotated dataset.
[AI-15] AlphaCNOT: Learning CNOT Minimization with Model-Based Planning
【速读】:该论文旨在解决量子电路优化中的CNOT门最小化问题,这是当前噪声中等规模量子(Noisy Intermediate Scale Quantum, NISQ)设备面临的关键挑战之一,因为错误传播往往随操作数量增加而加剧。CNOT门作为通用Clifford+T门集中的唯一两量子比特门,在量子线路实现中具有基础性地位。传统方法如 Patel-Markov-Hayes (PMH) 算法适用于无拓扑约束的线性可逆合成场景,而近年来基于强化学习(Reinforcement Learning, RL)的方法则尝试处理更复杂的拓扑感知合成问题。本文提出 AlphaCNOT,一个基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的模型驱动型强化学习框架,将 CNOT 最小化建模为规划问题,其关键创新在于利用 lookahead 搜索机制评估未来轨迹,从而发现更高效的 CNOT 序列。实验表明,AlphaCNOT 在线性可逆合成任务上相比 PMH 基线减少最多 32% 的 CNOT 门数,并在多种拓扑结构下相较于最先进的 RL 方法实现了稳定的门数削减,验证了 RL 与搜索策略结合在量子电路优化中的有效性。
链接: https://arxiv.org/abs/2604.13812
作者: Jacopo Cossio,Daniele Lizzio Bosco,Riccardo Romanello,Giuseppe Serra,Carla Piazza
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 22 pages, 11 figures , journal
Abstract:Quantum circuit optimization is a central task in Quantum Computing, as current Noisy Intermediate Scale Quantum devices suffer from error propagation that often scales with the number of operations. Among quantum operations, the CNOT gate is of fundamental importance, being the only 2-qubit gate in the universal Clifford+T set. The problem of CNOT gates minimization has been addressed by heuristic algorithms such as the well-known Patel-Markov-Hayes (PMH) for linear reversible synthesis (i.e., CNOT minimization with no topological constraints), and more recently by Reinforcement Learning (RL) based strategies in the more complex case of topology-aware synthesis, where each CNOT can act on a subset of all qubits pairs. In this work we introduce AlphaCNOT, a RL framework based on Monte Carlo Tree Search (MCTS) that address effectively the CNOT minimization problem by modeling it as a planning problem. In contrast to other RL- based solution, our method is model-based, i.e. it can leverage lookahead search to evaluate future trajectories, thus finding more efficient sequences of CNOTs. Our method achieves a reduction of up to 32% in CNOT gate count compared to PMH baseline on linear reversible synthesis, while in the constraint version we report a consistent gate count reduction on a variety of topologies with up to 8 qubits, with respect to state-of-the-art RL-based solutions. Our results suggest the combination of RL with search-based strategies can be applied to different circuit optimization tasks, such as Clifford minimization, thus fostering the transition toward the “quantum utility” era.
[AI-16] Soft Q(λ): A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces
【速读】:该论文旨在解决软Q-learning(Soft Q-learning)在多步扩展方面研究不足的问题,特别是现有方法仅限于基于Boltzmann策略的在线策略动作采样,难以实现高效的信用分配和离线策略学习。其解决方案的关键在于提出一种全新的Soft Tree Backup算子,将软Q-learning框架拓展至完全离线策略场景,并进一步统一为Soft Q(λ) 框架——一个兼具在线性、离线性和资格迹(eligibility trace)能力的通用方法,从而支持任意行为策略下的高效价值函数学习与信用分配。
链接: https://arxiv.org/abs/2604.13780
作者: Pranav Mahajan,Ben Seymour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Soft Q-learning has emerged as a versatile model-free method for entropy-regularised reinforcement learning, optimising for returns augmented with a penalty on the divergence from a reference policy. Despite its success, the multi-step extensions of soft Q-learning remain relatively unexplored and limited to on-policy action sampling under the Boltzmann policy. In this brief research note, we first present a formal n -step formulation for soft Q-learning and then extend this framework to the fully off-policy case by introducing a novel Soft Tree Backup operator. Finally, we unify these developments into Soft Q(\lambda) , an elegant online, off-policy, eligibility trace framework that allows for efficient credit assignment under arbitrary behaviour policies. Our derivations propose a model-free method for learning entropy-regularised value functions that can be utilised in future empirical experiments.
[AI-17] A Dynamic-Growing Fuzzy-Neuro Controller Application to a 3PSP Parallel Robot
【速读】:该论文旨在解决3PSP并联机器人位置控制中的复杂动态建模与实时稳定性问题,尤其在参数变化环境下保持高响应速度和低计算开销。其解决方案的关键在于提出一种动态生长模糊神经控制器(Dynamic Growing Fuzzy Neural Controller, DGFNC),通过引入自适应策略替代传统修剪机制,实现规则库的保守性增长,从而避免冗余结构;同时结合基于滑模的非线性控制方法以确保系统全局稳定性,最终在保证控制精度的同时提升响应速度并降低计算复杂度。
链接: https://arxiv.org/abs/2604.13763
作者: Mohsen Jalaeian-Farimani,Mohammad-R Akbarzadeh-T,Alireza Akbarzadeh,Mostafa Ghaemi
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: 2012 IEEE International Conference on Fuzzy Systems
Abstract:To date, various paradigms of soft-Computing have been used to solve many modern problems. Among them, a self organizing combination of fuzzy systems and neural networks can make a powerful decision making system. Here, a Dynamic Growing Fuzzy Neural Controller (DGFNC) is combined with an adaptive strategy and applied to a 3PSP parallel robot position control problem. Specifically, the dynamic growing mechanism is considered in more detail. In contrast to other self-organizing methods, DGFNC adds new rules more conservatively; hence the pruning mechanism is omitted. Instead, the adaptive strategy ‘adapts’ the control system to parameter variation. Furthermore, a sliding mode-based nonlinear controller ensures system stability. The resulting general control strategy aims to achieve faster response with less computation while maintaining overall stability. Finally, the 3PSP is chosen due to its complex dynamics and the utility of such approaches in modern industrial systems. Several simulations support the merits of the proposed DGFNC strategy as applied to the 3PSP robot.
[AI-18] he cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在多步骤任务中出现的推理退化、循环、漂移和卡顿状态等问题,这些问题在复杂任务中的发生率可高达30%。现有解决方案存在局限性:硬性步数限制会导致任务中断,而基于LLM作为裁判的监控机制则带来每步10%-15%的计算开销。论文提出“认知伴侣”(Cognitive Companion)这一并行监控架构,其核心创新在于引入两种实现方式——基于LLM的伴侣和一种零开销探测器(Probe-based Companion)。关键突破在于通过分析模型隐藏状态(如第28层)设计轻量级探测器,在不增加推理延迟的前提下显著减少重复行为(降低52-62%),并在小规模代理标注数据上达到AUROC 0.840的检测性能,同时揭示了同伴干预的效果具有任务类型依赖性,尤其在循环倾向性和开放式任务中最为有效,从而为未来选择性激活机制提供了理论依据。
链接: https://arxiv.org/abs/2604.13759
作者: Rafflesia Khan,Nafiul Islam Khan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4B, with an additional exploratory small-model analysis on Qwen 2.5 1.5B and Llama 3.2 1B. In our experiments, the LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with approximately 11% overhead. The Probe-based Companion, trained on hidden states from layer 28, showed a mean effect size of +0.471 at zero measured inference overhead; its strongest probe result achieved cross-validated AUROC 0.840 on a small proxy-labeled dataset. A key empirical finding is that companion benefit appears task-type dependent: companions are most helpful on loop-prone and open-ended tasks, while effects are neutral or negative on more structured tasks. Our small-model experiments also suggest a possible scale boundary: companions did not improve the measured quality proxy on 1B-1.5B models, even when interventions fired. Overall, the paper should be read as a feasibility study rather than a definitive validation. The results provide encouraging evidence that sub-token monitoring may be useful, identify task-type sensitivity as a practical design constraint, and motivate selective companion activation as a promising direction for future work.
[AI-19] Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在长时程任务中因稀疏或不完美奖励信号导致的探索效率低和信用分配差的问题,同时克服视觉-语言-动作(Vision-Language-Action, VLA)模型在高频、高精度机器人操作中直接应用的局限性。解决方案的关键在于提出一种名为“视觉-语言-动作跳转启动”(Vision-Language-Action Jump-Starting, VLAJS)的方法,其核心是将VLA作为短期高阶动作建议源,在训练初期引导探索并改善信用分配,同时保留RL的高频状态控制能力;具体通过在近端策略优化(Proximal Policy Optimization, PPO)中引入方向性动作一致性正则化(directional action-consistency regularization),软性对齐RL代理的动作与VLA指导,无需严格模仿、演示或持续教师查询,且VLA指导随时间稀疏化并逐渐退火,使代理能够在线适应并最终超越引导策略。
链接: https://arxiv.org/abs/2604.13733
作者: Angelo Moroncelli,Roberto Zanetti,Marco Maccarini,Loris Roveda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent’s actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.
[AI-20] owards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt INTERSPEECH2026
【速读】:该论文旨在解决大型音频语言模型(Large Audio-Language Models, LALMs)在时间感知(temporal perception)方面的局限性,尤其是在细粒度音频任务中难以准确推断事件起始与结束时间的问题。解决方案的关键在于提出一种名为TimePro-RL的框架,其核心包括两个创新:一是引入音频侧时间提示(Audio-Side Time Prompt),将时间戳编码为嵌入并交错插入音频特征序列中作为时间坐标以引导模型;二是基于监督微调(Supervised Fine-Tuning, SFT)后采用强化学习(Reinforcement Learning, RL)直接优化时间对齐性能,从而显著提升模型在音频定位、声音事件检测和密集音频描述等任务中的表现。
链接: https://arxiv.org/abs/2604.13715
作者: Yanfeng Shi,Pengfei Cai,Jun Liu,Qing Gu,Nan Jiang,Lirong Dai,Ian McLoughlin,Yan Song
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Submitted to Interspeech 2026
Abstract:Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.
[AI-21] Weight Patching: Toward Source-Level Mechanistic Localization in LLM s
【速读】:该论文旨在解决机制可解释性(mechanistic interpretability)中的关键挑战:即如何准确识别模型内部组件中真正因果性地实现特定行为的模块,而非仅仅在激活空间中表现出重要性的信号聚合或放大单元。现有方法如激活空间定位和因果追踪虽能定位行为相关模块,但无法区分哪些模块是真正编码了目标能力的参数化组件。解决方案的关键在于提出“权重修补”(Weight Patching)这一参数空间干预方法,通过在结构相同但行为表现差异显著的两个模型之间,将目标行为更强的专用模型中选定模块的权重替换到基础模型中,并在固定输入下进行测试,从而直接评估这些模块对行为的因果贡献。该方法结合一个以向量锚点行为接口为核心的分析框架,有效揭示了从浅层候选源模块到聚合路由模块再到下游执行电路的层级结构,同时为机制感知的模型融合提供了指导与外部验证。
链接: https://arxiv.org/abs/2604.13694
作者: Chenghao Sun,Chengsheng Zhang,Guanzheng Qin,Rui Dai,Xinmei Tian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages. Submitted to IEEE for possible publication
Abstract:Mechanistic interpretability seeks to localize model behavior to the internal components that causally realize it. Prior work has advanced activation-space localization and causal tracing, but modules that appear important in activation space may merely aggregate or amplify upstream signals rather than encode the target capability in their own parameters. To address this gap, we propose Weight Patching, a parameter-space intervention method for source-oriented analysis in paired same-architecture models that differ in how strongly they express a target capability under the inputs of interest. Given a base model and a behavior-specialized counterpart, Weight Patching replaces selected module weights from the specialized model into the base model under a fixed input. We instantiate the method on instruction following and introduce a framework centered on a vector-anchor behavioral interface that provides a shared internal criterion for whether a task-relevant control state has been formed or recovered in open-ended generation. Under this framework, the analysis reveals a hierarchy from shallow candidate source-side carriers to aggregation and routing modules, and further to downstream execution circuits. The recovered component scores can also guide mechanism-aware model merging, improving selective fusion across the evaluated expert combinations and providing additional external validation.
[AI-22] Automatically Inferring Teachers Geometric Content Knowledge: A Skills Based Approach
【速读】:该论文旨在解决教师几何内容知识(Geometric Content Knowledge)评估难以规模化的问题,传统基于Van Hiele模型的评估依赖人工专家对开放式作答进行分析,效率低且成本高。解决方案的关键在于构建一个基于教育理论的结构化技能词典,将Van Hiele五级推理水平细分为33个精细的推理技能,并结合大型语言模型(Large Language Models, LLMs)开发自动化分类方法。通过检索增强生成(Retrieval-Augmented Generation, RAG)和多任务学习(Multi-Task Learning, MTL)两种策略,引入显式技能信息显著提升了分类准确率,实现了首个从开放式回答中自动诊断教师Van Hiele推理水平的方法,为大规模教师几何推理能力评估和个性化教师发展支持系统提供了可扩展、理论驱动的技术路径。
链接: https://arxiv.org/abs/2604.13666
作者: Ziv Fenigstein,Kobi Gal,Avi Segal,Osama Swidan,Inbal Israel,Hassan Ayoob
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The work is accepted for publication as a full paper (Main Track) at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:Assessing teachers’ geometric content knowledge is essential for geometry instructional quality and student learning, but difficult to scale. The Van Hiele model characterizes geometric reasoning through five hierarchical levels. Traditional Van Hiele assessment relies on manual expert analysis of open-ended responses. This process is time-consuming, costly, and prevents large-scale evaluation. This study develops an automated approach for diagnosing teachers’ Van Hiele reasoning levels using large language models grounded in educational theory. Our central hypothesis is that integrating explicit skills information significantly improves Van Hiele classification. In collaboration with mathematics education researchers, we built a structured skills dictionary decomposing the Van Hiele levels into 33 fine-grained reasoning skills. Through a custom web platform, 31 pre-service teachers solved geometry problems, yielding 226 responses. Expert researchers then annotated each response with its Van Hiele level and demonstrated skills from the dictionary. Using this annotated dataset, we implemented two classification approaches: (1) retrieval-augmented generation (RAG) and (2) multi-task learning (MTL). Each approach compared a skills-aware variant incorporating the skills dictionary against a baseline without skills information. Results showed that for both methods, skills-aware variants significantly outperformed baselines across multiple evaluation metrics. This work provides the first automated approach for Van Hiele level classification from open-ended responses. It offers a scalable, theory-grounded method for assessing teachers’ geometric reasoning that can enable large-scale evaluation and support adaptive, personalized teacher learning systems.
[AI-23] Ordinary Least Squares is a Special Case of Transformer
【速读】:该论文旨在解决Transformer架构的统计本质问题,即其究竟是通用逼近器还是某种已知计算算法的神经网络形式。研究通过严格的代数证明指出,后者更准确地刻画了Transformer的基本性质:普通最小二乘法(Ordinary Least Squares, OLS)是单层线性Transformer的一个特例。解决方案的关键在于利用经验协方差矩阵的谱分解构造特定参数配置,使得注意力机制的前向传播在数学上等价于OLS的闭式投影,从而表明注意力机制可在一次前向传递中完成求解而非迭代优化。这一发现不仅揭示了Transformer内部存在解耦的慢速与快速记忆机制,还阐明了从线性原型到标准Transformer的演进路径如何将Hopfield能量函数的存储容量从线性提升至指数级,建立了现代深度架构与经典统计推断之间的清晰连续性。
链接: https://arxiv.org/abs/2604.13656
作者: Xiaojun Tan,Yuchen Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
Abstract:The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer’s basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism’s forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.
[AI-24] A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
【速读】:该论文旨在解决生成式机器人策略训练中协同训练(co-training)机制不明确的问题,特别是当使用有限的真实域数据与大量替代数据(如仿真或跨机体机器人数据)结合时,其有效性背后的原理尚不清楚。解决方案的关键在于通过理论分析与实证研究识别出两个内在效应:一是“结构化表示对齐”(structured representation alignment),它平衡了跨域表示对齐与域可区分性,是决定下游性能的主要因素;二是“重要性重加权效应”(importance reweighting effect),源于域相关的动作权重调节,在次要层面影响性能。该分析为现有协同训练方法提供了统一解释,并提出一种简单有效的改进策略,显著优于以往方法。
链接: https://arxiv.org/abs/2604.13645
作者: Yu Lei,Minghuan Liu,Abhiram Maddukuri,Zhenyu Jiang,Yuke Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 18 figure. Project page: this https URL
Abstract:Co-training, which combines limited in-domain real-world data with abundant surrogate data such as simulation or cross-embodiment robot data, is widely used for training generative robot policies. Despite its empirical success, the mechanisms that determine when and why co-training is effective remain poorly understood. We investigate the mechanism of sim-and-real co-training through theoretical analysis and empirical study, and identify two intrinsic effects governing performance. The first, \textbfstructured representation alignment", reflects a balance between cross-domain representation alignment and domain discernibility, and plays a primary role in downstream performance. The second, the \textbfimportance reweighting effect", arises from domain-dependent modulation of action weighting and operates at a secondary level. We validate these effects with controlled experiments on a toy model and extensive sim-and-sim and sim-and-real robot manipulation experiments. Our analysis offers a unified interpretation of recent co-training techniques and motivates a simple method that consistently improves upon prior approaches. More broadly, our aim is to examine the inner workings of co-training and to facilitate research in this direction.
[AI-25] SafeHarness: Lifecycle-Integrated Security Architecture for LLM -based Agent Deployment
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理执行环境中“执行 harness”(即协调工具调用、上下文管理和状态持久化的系统层)的安全性问题。现有安全方法因结构不匹配,无法有效监控 harness 内部状态且缺乏跨代理运行阶段的协同防御能力。解决方案的关键在于提出一种名为 \safeharness 的安全架构,其核心是将四个防御层直接嵌入代理生命周期:输入处理阶段的对抗性上下文过滤、决策阶段的分层因果验证、执行阶段的权限分离工具控制以及状态更新阶段的安全回滚与自适应降级。这些防御层通过跨层机制联动,当检测到持续异常时自动提升验证强度、触发回滚并收紧工具权限,从而显著降低攻击成功率和不安全行为率,同时保持任务功能完整性。
链接: https://arxiv.org/abs/2604.13630
作者: Xixun Lin,Yang Liu,Yancheng Chen,Yongxuan Wu,Yucheng Ning,Yilong Liu,Nan Sun,Shun Zhang,Bin Chong,Chuan Zhou,Yanan Cao,Li Guo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 26 pages, 6 figures
Abstract:The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use, context management, and state persistence. Yet this same architectural centrality makes the harness a high-value attack surface: a single compromise at the harness level can cascade through the entire execution pipeline. We observe that existing security approaches suffer from structural mismatch, leaving them blind to harness-internal state and unable to coordinate across the different phases of agent operation. In this paper, we introduce \safeharness, a security architecture in which four proposed defense layers are woven directly into the agent lifecycle to address above significant limitations: adversarial context filtering at input processing, tiered causal verification at decision making, privilege-separated tool control at action execution, and safe rollback with adaptive degradation at state update. The proposed cross-layer mechanisms tie these layers together, escalating verification rigor, triggering rollbacks, and tightening tool privileges whenever sustained anomalies are detected. We evaluate \safeharness on benchmark datasets across diverse harness configurations, comparing against four security baselines under five attack scenarios spanning six threat categories. Compared to the unprotected baseline, \safeharness achieves an average reduction of approximately 38% in UBR and 42% in ASR, substantially lowering both the unsafe behavior rate and the attack success rate while preserving core task utility.
[AI-26] Golden Handcuffs make safer AI agents
【速读】:该论文旨在解决强化学习代理在复杂环境中可能通过未预期策略获得高奖励,从而引发安全风险的问题。其核心挑战在于如何在保证代理能力的同时,避免其探索出可能导致严重后果的策略。解决方案的关键在于引入一种贝叶斯缓解机制:扩展代理主观奖励范围以包含一个较大的负值 -L(而真实环境奖励限定在 [0,1] 区间),使代理在持续获得高奖励后变得对可能导向 -L 的新颖策略趋于风险规避;同时设计一个简单的覆盖机制,在预测价值低于阈值时切换至安全导师控制,从而实现能力与安全性的平衡。
链接: https://arxiv.org/abs/2604.13609
作者: Aram Ebtekar,Michael K. Cohen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, preliminary version
Abstract:Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent’s subjective reward range to include a large negative value -L , while the true environment’s rewards lie in [0,1] . After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to -L . We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.
[AI-27] Design Space Exploration of Hybrid Quantum Neural Networks for Chronic Kidney Disease
【速读】:该论文旨在解决混合量子神经网络(Hybrid Quantum Neural Networks, HQNNs)在实际应用中性能高度依赖于设计选择的问题,特别是在慢性肾病(Chronic Kidney Disease, CKD)诊断任务中的优化难题。解决方案的关键在于系统性地探索HQNN的设计空间,通过组合五种数据编码方案、五种纠缠架构、五种测量策略和五种采样次数(shots),构建并评估625种不同的HQNN模型,并基于10折分层交叉验证和多维指标(准确率、AUC、F1-score及综合评分)进行公平且稳健的比较。研究发现,高性能并不一定依赖于高参数量或复杂电路,而是取决于编码方式与电路结构之间的非平凡交互作用,例如IQP编码结合环形纠缠架构可在准确性、鲁棒性和效率之间实现最优平衡,从而为HQNN的实际部署提供了可操作的设计指导。
链接: https://arxiv.org/abs/2604.13608
作者: Muhammad Kashif,Hanzalah Mohamed Siraj,Nouhaila Innan,Alberto Marchisio,Muhammad Shafique
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hybrid Quantum Neural Networks (HQNNs) have recently emerged as a promising paradigm for near-term quantum machine learning. However, their practical performance strongly depends on design choices such as classical-to-quantum data encoding, quantum circuit architecture, measurement strategy and shots. In this paper, we present a comprehensive design space exploration of HQNNs for Chronic Kidney Disease (CKD) diagnosis. Using a carefully curated and preprocessed clinical dataset, we benchmark 625 different HQNN models obtained by combining five encoding schemes, five entanglement architectures, five measurement strategies, and five different shot settings. To ensure fair and robust evaluation, all models are trained using 10-fold stratified cross-validation and assessed on a test set using a comprehensive set of metrics, including accuracy, area under the curve (AUC), F1-score, and a composite performance score. Our results reveal strong and non-trivial interactions between encoding choices and circuit architectures, showing that high performance does not necessarily require large parameter counts or complex circuits. In particular, we find that compact architectures combined with appropriate encodings (e.g., IQP with Ring entanglement) can achieve the best trade-off between accuracy, robustness, and efficiency. Beyond absolute performance analysis, we also provide actionable insights into how different design dimensions influence learning behavior in HQNNs.
[AI-28] Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals
【速读】:该论文旨在解决心音信号(phonocardiography, PCG)在自动心血管疾病诊断中因非平稳性导致特征提取不准确的问题。其解决方案的关键在于通过滑动窗分割PCG信号并优化窗口形状与长度,以提升分类性能。研究对比了三种窗口形状(矩形、三角形、高斯)及每种形状下的三个长度,发现使用高斯窗(75 ms)可获得最佳分类效果,显著优于基线方法;而矩形窗因产生不利频谱旁瓣,表现最差,表明窗口设计对特征质量具有决定性影响。
链接: https://arxiv.org/abs/2604.13567
作者: Mahmoud Fakhry,Abeer FathAllah Brery
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Heart sound signals, phonocardiography (PCG) signals, allow for the automatic diagnosis of potential cardiovascular pathology. Such classification task can be tackled using the bidirectional long short-term memory (biLSTM) network, trained on features extracted from labeled PCG signals. Regarding the non-stationarity of PCG signals, it is recommended to extract the features from multiple short-length segments of the signals using a sliding window of certain shape and length. However, some window contains unfavorable spectral side lobes, which distort the features. Accordingly, it is preferable to adapt the window shape and length in terms of classification performance. We propose an experimental evaluation for three window shapes, each with three window lengths. The biLSTM network is trained and tested on statistical features extracted, and the performance is reported in terms of the window shapes and lengths. Results show that the best performance is obtained when the Gaussian window is used for splitting the signals, and the triangular window competes with the Gaussian window for a length of 75 ms. Although the rectangular window is a commonly offered option, it is the worst choice for splitting the signals. Moreover, the classification performance obtained with a 75 ms Gaussian window outperforms that of a baseline method.
[AI-29] RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
【速读】:该论文旨在解决现有图形用户界面(GUI)代理在高风险、调查性场景中,如真实电商风控管理领域的评估与应用不足的问题。当前交互式基准测试主要面向良性、可预测的消费环境,难以反映复杂且对抗性强的真实业务挑战。其解决方案的关键在于提出RiskWebWorld——首个高度真实的交互式基准,涵盖来自8个核心领域共1,513项生产级风险控制任务,并模拟非合作网站及部分环境劫持等现实挑战;同时构建符合Gymnasium标准的基础设施,实现策略规划与环境机制解耦,从而支持可扩展的评估与基于代理强化学习(agentic RL)的模型优化。实验证明,该框架不仅揭示了通用大模型在长程专业任务中的显著优势,还验证了其在提升开源模型性能上的有效性(提升16.2%),为开发稳健的数字员工提供了实用测试平台。
链接: https://arxiv.org/abs/2604.13531
作者: Renqi Chen,Zeyin Tao,Jianming Guo,Jing Wang,Zezhou Xu,Jingzhe Zhu,Qingqing Sun,Tianyi Zhang,Shuai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.
[AI-30] C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
【速读】:该论文旨在解决递归神经网络模型在测试阶段性能提升的难题,尤其是如何通过不依赖额外训练的方式增强模型推理能力。其核心问题是现有测试时扩展(test-time scaling)策略对模型结构有较强限制,例如能量基投票(energy-based voting)仅适用于具有显式能量函数的模型,从而限制了通用性与适用范围。解决方案的关键在于提出一种基于置信度的投票机制(confidence-based voting, C-voting),该方法通过随机初始化多个潜在轨迹(latent candidate trajectories),并选择使预测概率的前1名平均值最大的轨迹作为最终输出,以此反映模型对结果的置信度。C-voting 不依赖于显式能量函数,因此具备广泛适用性,并在Sudoku-hard任务中相比能量基投票提升4.9%准确率;同时结合改进的注意力递归模型ItrSA++,在Sudoku-extreme和Maze任务上显著优于HRM模型,验证了该策略的有效性和普适性。
链接: https://arxiv.org/abs/2604.13521
作者: Kenji Kubo,Shunsuke Kamiya,Masanori Koyama,Kohei Hayashi,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural network models with latent recurrent processing, where identical layers are recursively applied to the latent state, have gained attention as promising models for performing reasoning tasks. A strength of such models is that they enable test-time scaling, where the models can enhance their performance in the test phase without additional training. Models such as the Hierarchical Reasoning Model (HRM) and Artificial Kuramoto Oscillatory Neurons (AKOrN) can facilitate deeper reasoning by increasing the number of recurrent steps, thereby enabling the completion of challenging tasks, including Sudoku, Maze solving, and AGI benchmarks. In this work, we introduce confidence-based voting (C-voting), a test-time scaling strategy designed for recurrent models with multiple latent candidate trajectories. Initializing the latent state with multiple candidates using random variables, C-voting selects the one maximizing the average of top-1 probabilities of the predictions, reflecting the model’s confidence. Additionally, it yields 4.9% higher accuracy on Sudoku-hard than the energy-based voting strategy, which is specific to models with explicit energy functions. An essential advantage of C-voting is its applicability: it can be applied to recurrent models without requiring an explicit energy function. Finally, we introduce a simple attention-based recurrent model with randomized initial values named ItrSA++, and demonstrate that when combined with C-voting, it outperforms HRM on Sudoku-extreme (95.2% vs. 55.0%) and Maze (78.6% vs. 74.5%) tasks.
[AI-31] From Alignment to Prediction: A Study of Self-Supervised Learning and Predictive Representation Learning
【速读】:该论文旨在解决当前自监督学习(Self-supervised Learning, SSL)方法在处理未观测数据时缺乏预测能力的问题,即现有方法主要聚焦于表示对齐(alignment)和输入重构(reconstruction),难以构建能够预测数据分布结构的学习框架。其解决方案的关键在于提出了一种新的范式——预测表征学习(Predictive Representation Learning, PRL),该范式强调基于可观测数据对不可观测数据成分进行潜在预测,从而建立更具泛化性和预测性的表征模型。论文进一步以联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA)为例,验证了PRL在实际任务中的有效性,并通过对比Bootstrap Your Own Latent (BYOL)、Masked Autoencoders (MAE) 和 Image-JEPA (I-JEPA) 的实验结果,指出PRL类方法在保持高准确率的同时展现出更强的鲁棒性,凸显其作为未来自监督学习研究方向的潜力。
链接: https://arxiv.org/abs/2604.13518
作者: Mintu Dutta,Ritesh Vyas,Mohendra Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This article has been submitted to the 2026 International Conference on Applied Artificial Intelligence (2AI), Central University of Kashmir, India
Abstract:Self-supervised learning has emerged as a major technique for the task of learning from unlabeled data, where the current methods mostly revolve around alignment of representations and input recon struction. Although such approaches have demonstrated excellent performance in practice, their scope remains mostly confined to learning from observed data and does not provide much help in terms of a learning structure that is predictive of the data distribution. In this paper, we study some of the recent developments in the realm of self-supervised learning. We define a new category called Predictive Representation Learning (PRL), which revolves around the latent prediction of unobserved components of data based on the observation. We propose a common taxonomy that classifies PRL along with alignment and reconstruction-based learning approaches. Furthermore, we argue that Joint-Embedding Predictive Architecture(JEPA) can be considered as an exemplary member of this new paradigm. We further discuss theoretical perspectives and open challenges, highlighting predictive representation learning as a promising direction for future self-supervised learning research. In this study, we implemented Bootstrap Your Own Latent (BYOL), Masked Autoencoders (MAE), and Image-JEPA (I-JEPA) for comparative analysis. The results indicate that MAE achieves perfect similarity of 1.00, but exhibits relatively weak robustness of 0.55. In contrast, BYOL and I-JEPA attain accuracies of 0.98 and 0.95, with robustness scores of 0.75 and 0.78, respectively.
[AI-32] Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO
【速读】:该论文旨在解决强化学习中时间信用分配(temporal credit assignment)问题,特别是在多时间尺度(multi-timescale)信号融合时导致的算法病理现象,如策略梯度代理目标劫持(surrogate objective hacking)和不可逆的短视退化(irreversible myopic degeneration),后者被作者称为“时间不确定性悖论”(Paradox of Temporal Uncertainty)。解决方案的关键在于提出一种目标解耦(Target Decoupling)架构:在Critic端保留多时间尺度预测以促进辅助表征学习,而在Actor端严格隔离短期信号,仅基于长期优势(long-term advantage)更新策略,从而避免路径依赖性干扰并提升策略稳定性与性能。
链接: https://arxiv.org/abs/2604.13517
作者: Jing Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ‘‘Environment Solved’’ threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines.
[AI-33] SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
【速读】:该论文旨在解决后训练阶段中监督微调(Supervised Fine-Tuning, SFT)与组相对策略优化(Group Relative Policy Optimization, GRPO)之间数据重叠对模型性能影响的不确定性问题。现有实践中常将SFT和GRPO使用相同或高度重叠的数据集,但其对最终性能的具体作用机制尚不明确。解决方案的关键在于通过受控消融实验,系统性地考察不同SFT-GRPO数据重叠比例(0%、30%、100%)对模型在Lean 4自动形式化任务中的编译通过率(compile pass at k)和语义正确率(semantic pass at k)的影响。结果表明:保持SFT与GRPO数据完全分离(0%重叠)可显著提升模型性能,且无需额外计算成本;而100%重叠时GRPO几乎无效,说明数据隔离是释放GRPO潜力的核心因素。此外,双指标评估揭示了高编译率模型在语义层面存在超过30个百分点的差距,凸显了仅依赖编译通过率可能掩盖真实语义能力不足的问题。
链接: https://arxiv.org/abs/2604.13515
作者: Xiaole Su,Kasey Zhang,Andy Lyu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.
[AI-34] owards Scalable Lightweight GUI Agents via Multi-role Orchestration ACL2026
【速读】:该论文旨在解决轻量级多模态大语言模型(Multimodal Large Language Models, MLLMs)在资源受限设备上进行图形用户界面(GUI)自动化时面临的成本与可扩展性矛盾问题。具体而言,现有方法在复杂真实场景下受限于模型容量不足和任务扩展性差,难以支持多智能体系统(Multi-Agent Systems, MAS)的协同工作,同时训练多个专用技能专家也存在高昂成本。解决方案的关键在于提出LAMO框架,通过角色导向的数据合成与两阶段训练策略实现高效知识迁移与任务扩展:第一阶段采用困惑度加权交叉熵优化的监督微调以增强视觉感知并完成知识蒸馏;第二阶段引入强化学习实现角色导向的协作探索,从而赋予轻量级MLLM在GUI自动化中具备多角色调度能力与任务可扩展性。最终构建的LAMO-3B代理支持单体执行与MAS式编排,并可通过插件式规划器持续提升性能上限。
链接: https://arxiv.org/abs/2604.13488
作者: Ziwei Wang,Junjie Zheng,Leyang Yang,Sheng Zhou,Xiaoxuan Tang,Zhouhua Fang,Zhiwei Liu,Dajun Chen,Yong Li,Jiajun Bu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Findings of ACL 2026
Abstract:Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end-user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource-constrained devices. When facing complex in-the-wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end-to-end episodic learning, impeding adaptation to multi-agent systems (MAS), while training multiple skill-specific experts remains costly. Can we strike an effective trade-off in this cost-scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows? To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI-specific knowledge and task scalability, allowing multi-role orchestration to expand its capability boundary for GUI automation. LAMO combines role-oriented data synthesis with a two-stage training recipe: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception enhancement, and (ii) reinforcement learning for role-oriented cooperative exploration. With LAMO, we develop a task-scalable native GUI agent, LAMO-3B, supporting monolithic execution and MAS-style orchestration. When paired with advanced planners as a plug-and-play policy executor, LAMO-3B can continuously benefit from planner advances, enabling a higher performance ceiling. Extensive static and online evaluations validate the effectiveness of our design.
[AI-35] Monthly Diffusion v0.9: A Latent Diffusion Model for the First AI-MIP
【速读】:该论文旨在解决气候模拟中低频内部大气变异性的建模问题,特别是在数据稀疏区域和计算资源有限的情况下,如何高效地进行月尺度的气候演化预测。解决方案的关键在于提出了一种名为Monthly Diffusion at 1.5-degree grid spacing (MD-1.5 version 0.9) 的气候模拟器,其核心架构基于受球面傅里叶神经算子(Spherical Fourier Neural Operator, SFNO)启发的条件变分自编码器(Conditional Variational Auto-Encoder, CVAE),并利用潜在扩散模型(latent diffusion)来捕捉和生成低频大气过程的复杂动态特征。该方法在保证物理合理性的前提下,实现了以月均时间步长进行前向演化的高效率建模。
链接: https://arxiv.org/abs/2604.13481
作者: Kyle J. C. Hall,Maria J. Molina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Here, we describe Monthly Diffusion at 1.5-degree grid spacing (MD-1.5 version 0.9), a climate emulator that leverages a spherical Fourier neural operator (SFNO)-inspired Conditional Variational Auto-Encoder (CVAE) architecture to model the evolution of low-frequency internal atmospheric variability using latent diffusion. MDv0.9 was designed to forward-step at monthly mean timesteps in a data-sparse regime, using modest computational requirements. This work describes the motivation behind the architecture design, the MDv0.9 training procedure, and initial results.
[AI-36] Secure and Privacy-Preserving Vertical Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中垂直分割场景下的隐私保护问题,即在特征由不同客户端划分、标签不被所有参与方共享的情况下,如何同时保障输入和输出隐私。其解决方案的关键在于将传统联邦学习中聚合器(aggregator)的角色拆分至多个服务器,并通过安全多方计算(Secure Multiparty Computation, MPC)协议实现模型与特征的聚合,同时在最终发布的模型上引入差分隐私(Differential Privacy, DP)机制。此外,作者提出了一种优化方案,支持纯全局(global)及全局-局部(global-local)模型更新,显著减少了基于MPC的计算与通信开销,从而在保证隐私的前提下提升了效率。
链接: https://arxiv.org/abs/2604.13474
作者: Shan Jin,Sai Rahul Rachuri,Yizhen Wang,Anderson C.A. Nascimento,Yiwei Cai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:We propose a novel end-to-end privacy-preserving framework, instantiated by three efficient protocols for different deployment scenarios, covering both input and output privacy, for the vertically split scenario in federated learning (FL), where features are split across clients and labels are not shared by all parties. We do so by distributing the role of the aggregator in FL into multiple servers and having them run secure multiparty computation (MPC) protocols to perform model and feature aggregation and apply differential privacy (DP) to the final released model. While a naive solution would have the clients delegating the entirety of training to run in MPC between the servers, our optimized solution, which supports purely global and also global-local models updates with privacy-preserving, drastically reduces the amount of computation and communication performed using multiparty computation. The experimental results also show the effectiveness of our protocols.
[AI-37] Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment ICSE
【速读】:该论文旨在解决金融机构中因IT变更引发的运营风险问题,特别是在高度监管环境下,如何通过数据驱动的方法识别高风险变更以实现事前预警。其解决方案的关键在于构建一个可审计、可解释的风险评分模型,采用SHAP(Shapley Additive Explanations)值提供特征层面的解释性,确保决策过程透明可追溯;同时对比了三种机器学习模型(HGBC、LightGBM、XGBoost),发现LightGBM在引入团队级聚合指标(反映组织上下文)后表现最优,验证了融合业务语境的数据驱动方法能够超越传统规则基流程,在满足合规要求的同时提升IT变更管理的可靠性与前瞻性。
链接: https://arxiv.org/abs/2604.13462
作者: Eileen Kapel,Jan Lennartz,Luis Cruz,Diomidis Spinellis,Arie van Deursen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 12 pages, 6 figures, 2026 IEEE/ACM 48th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)
Abstract:Effective IT change management is important for businesses that depend on software and services, particularly in highly regulated sectors such as finance, where operational reliability, auditability, and explainability are essential. A significant portion of IT incidents are caused by changes, making it important to identify high-risk changes before deployment. This study presents a predictive incident risk scoring approach at a large international bank. The approach supports engineers during the assessment and planning phases of change deployments by predicting the potential of inducing incidents. To satisfy regulatory constraints, we built the model with auditability and explainability in mind, applying SHAP values to provide feature-level insights and ensure decisions are traceable and transparent. Using a one-year real-world dataset, we compare the existing rule-based process with three machine learning models: HGBC, LightGBM, and XGBoost. LightGBM achieved the best performance, particularly when enriched with aggregated team metrics that capture organisational context. Our results show that data-driven, interpretable models can outperform rule-based approaches while meeting compliance needs, enabling proactive risk mitigation and more reliable IT operations.
[AI-38] From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning
【速读】:该论文旨在解决持续学习(continual learning)中的“遗忘”问题,即模型在顺序学习新任务时对先前任务性能下降的现象。传统研究多基于固定任务集合的随机排序进行经验分析,而本文则从任务分布的角度出发,提出在过参数化线性回归的精确拟合场景下,将任务视为独立同分布(i.i.d.)地从任务分布 Π 中采样,从而探究生成分布本身如何决定遗忘行为。其解决方案的关键在于推导出遗忘量的一个精确算子恒等式,揭示了遗忘具有递归谱结构;在此基础上,建立了无条件上界、识别主导渐近项,并在一般非退化情形下刻画了收敛速率至常数因子,进一步将该速率与任务分布的几何特性相联系,明确了导致慢遗忘或快遗忘的机制。
链接: https://arxiv.org/abs/2604.13460
作者: Zonghuan Xu,Xingjun Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A central challenge in continual learning is forgetting, the loss of performance on previously learned tasks induced by sequential adaptation to new ones. While forgetting has been extensively studied empirically, rigorous theoretical characterizations remain limited. A notable step in this direction is \citetevron2022catastrophic, which analyzes forgetting under random orderings of a fixed task collection in overparameterized linear regression. We shift the perspective from order to distribution. Rather than asking how a fixed task collection behaves under random orderings, we study an exact-fit linear regime in which tasks are sampled i.i.d.\ from a task distribution~ \Pi , and ask how the generating distribution itself governs forgetting. In this setting, we derive an exact operator identity for the forgetting quantity, revealing a recursive spectral structure. Building on this identity, we establish an unconditional upper bound, identify the leading asymptotic term, and, in generic nondegenerate cases, characterize the convergence rate up to constants. We further relate this rate to geometric properties of the task distribution, clarifying what drives slow or fast forgetting in this model.
[AI-39] Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps
【速读】:该论文旨在解决涡轮风扇发动机在持续运行应力下关键部件剩余使用寿命(Remaining Useful Life, RUL)预测的准确性问题,尤其针对现有深度学习方法难以同时捕捉多传感器空间相关性与长程时间依赖性的缺陷,以及标准对称损失函数无法有效惩罚过估计残余寿命这一安全敏感误差的问题。解决方案的关键在于提出一种混合架构:融合两阶段一维卷积神经网络(Twin-Stage One-Dimensional Convolutional Neural Networks, 1D-CNN)以提取多传感器特征的空间模式、双向长短期记忆网络(Bidirectional Long Short-Term Memory, BiLSTM)建模长期时序依赖,并引入定制化的Bahdanau Additive Attention机制增强可解释性;同时采用零泄漏预处理流程、分段线性RUL标签(上限130循环)及NASA指定的非对称指数损失函数,显著强化对过估计误差的惩罚,从而实现工业场景中安全、准确且可解释的预测性能。
链接: https://arxiv.org/abs/2604.13459
作者: Mohammed Ezzaldin Babiker Abdullah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi-sensor spatial correlations and long-range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety-critical error of over-estimating residual life. This study proposes a hybrid architecture integrating Twin-Stage One-Dimensional Convolutional Neural Networks (1D-CNN), a Bidirectional Long Short-Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) FD001 sub-dataset employing a zero-leakage preprocessing pipeline, piecewise-linear RUL labeling capped at 130 cycles, and the NASA-specified asymmetric exponential loss function that disproportionately penalizes over-estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S-Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per-engine insights into the temporal progression of degradation, supporting informed maintenance decision-making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.
[AI-40] Outperforming Self-Attention Mechanisms in Solar Irradiance Forecasting via Physics-Guided Neural Networks
【速读】:该论文旨在解决高噪声气象条件下全球水平辐照度(Global Horizontal Irradiance, GHI)精准预测问题,尤其是在干旱地区因气溶胶快速波动导致的预测挑战。其解决方案的关键在于提出一种轻量级、物理信息驱动的混合CNN-BiLSTM模型,通过引入15个工程化特征(如晴空指数和太阳天顶角)显式嵌入领域知识,而非仅依赖原始历史数据进行训练;同时采用贝叶斯优化进行超参数调优以确保全局最优性,从而在保证计算效率的同时显著提升预测精度(RMSE达19.53 W/m²),验证了“复杂性悖论”:在高噪声任务中,物理约束比复杂的自注意力机制更高效且准确。
链接: https://arxiv.org/abs/2604.13455
作者: Mohammed Ezzaldin Babiker Abdullah,Rufaidah Abdallah Ibrahim Mohammed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: This is a second version of a previously published paper. DOI: this https URL
Abstract:Accurate Global Horizontal Irradiance (GHI) forecasting is critical for grid stability, particularly in arid regions characterized by rapid aerosol fluctuations. While recent trends favor computationally expensive Transformer-based architectures, this paper challenges the prevailing “complexity-first” paradigm. We propose a lightweight, Physics-Informed Hybrid CNN-BiLSTM framework that prioritizes domain knowledge over architectural depth. The model integrates a Convolutional Neural Network (CNN) for spatial feature extraction with a Bi-Directional LSTM for capturing temporal dependencies. Unlike standard data-driven approaches, our model is explicitly guided by a vector of 15 engineered features including Clear-Sky indices and Solar Zenith Angle - rather than relying solely on raw historical data. Hyperparameters are rigorously tuned using Bayesian Optimization to ensure global optimality. Experimental validation using NASA POWER data in Sudan demonstrates that our physics-guided approach achieves a Root Mean Square Error (RMSE) of 19.53 W/m^2, significantly outperforming complex attention-based baselines (RMSE 30.64 W/m^2). These results confirm a “Complexity Paradox”: in high-noise meteorological tasks, explicit physical constraints offer a more efficient and accurate alternative to self-attention mechanisms. The findings advocate for a shift towards hybrid, physics-aware AI for real-time renewable energy management.
[AI-41] A KL Lens on Quantization: Fast Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
【速读】:该论文旨在解决在资源受限的边缘设备上部署大型语言模型(Large Language Models, LLMs)时面临的计算和内存瓶颈问题,特别是在采用激进量化(aggressive quantization)以压缩模型规模并加速推理的过程中,不同组件对量化敏感度不一导致性能下降难以控制的问题。解决方案的关键在于提出一种轻量级、无需反向传播(backpropagation-free)的基于代理(surrogate-based)敏感性分析框架,通过仅依赖前向传播指标识别混合状态空间模型(Structured State Space Models, SSM)与Transformer结构中对量化最敏感的组件;同时,理论证明并实验证明Kullback-Leibler(KL)散度作为量化敏感性评估指标优于均方误差(MSE)和信噪比(SQNR),从而实现KL引导的混合精度量化策略,在Intel Lunar Lake硬件上的实测表明该方法可在保持接近FP16困惑度(perplexity)的同时,显著降低模型尺寸并提升吞吐量,且适用于缺乏领域数据的隐私敏感场景。
链接: https://arxiv.org/abs/2604.13440
作者: Jason Kong,Nilesh Prasad Pandey,Flavio Ponzina,Tajana Rosing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real-time processing and on-device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer-based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation-free, surrogate-based sensitivity analysis framework to identify hybrid SSM-Transformer components most susceptible to quantization-induced degradation. Relying solely on forward-pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in-domain data is limited due to proprietary restrictions or privacy constraints. We also provide a formal analysis showing that the Kullback-Leibler (KL) divergence metric better captures quantization sensitivity for Language modeling tasks than widely adopted alternatives such as mean squared error (MSE) and signal-to-quantization-noise ratio (SQNR). Through extensive experiments on SSM and hybrid architectures, our ablation studies confirm that KL-based rankings align with observed performance drops and outperform alternative metrics. This framework enables the practical deployment of advanced hybrid models on resource-constrained edge devices with minimal accuracy loss. We further validate our approach with real-world on-device profiling on Intel Lunar Lake hardware, demonstrating that KL-guided mixed-precision achieves near-FP16 perplexity with model sizes and throughput competitive with Uniform INT4 on both CPU and GPU execution modes. Code is available at this https URL.
[AI-42] he Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在关键软件系统中部署时,因幻觉(hallucination)和“虚假真实性”(faked truthfulness)导致的可靠性问题。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)或LLM作为裁判评估器等外在检测机制,存在延迟高、计算开销大及对外部API依赖等问题,难以满足服务等级协议(Service Level Agreements, SLAs)要求。解决方案的关键在于提出一种名为“认知断路器”(Cognitive Circuit Breaker)的系统工程框架,通过在模型前向传播过程中提取隐藏状态,计算“认知失调差值”(Cognitive Dissonance Delta)——即模型外部语义置信度(softmax概率)与内部潜在确定性(由线性探测器推导)之间的数学差距,从而实现低延迟的内在可靠性监控,并在不显著增加计算负担的前提下有效识别认知失调现象。
链接: https://arxiv.org/abs/2604.13417
作者: Jonathan Pan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 2 Figures
Abstract:As Large Language Models (LLMs) are increasingly deployed in mission-critical software systems, detecting hallucinations and faked truthfulness'' has become a paramount engineering challenge. Current reliability architectures rely heavily on post-generation, black-box mechanisms, such as Retrieval-Augmented Generation (RAG) cross-checking or LLM-as-a-judge evaluators. These extrinsic methods introduce unacceptable latency, high computational overhead, and reliance on secondary external API calls, frequently violating standard software engineering Service Level Agreements (SLAs). In this paper, we propose the Cognitive Circuit Breaker, a novel systems engineering framework that provides intrinsic reliability monitoring with minimal latency overhead. By extracting hidden states during a model's forward pass, we calculate the Cognitive Dissonance Delta’’ – the mathematical gap between an LLM’s outward semantic confidence (softmax probabilities) and its internal latent certainty (derived via linear probes). We demonstrate statistically significant detection of cognitive dissonance, highlight architecture-dependent Out-of-Distribution (OOD) generalization, and show that this framework adds negligible computational overhead to the active inference pipeline.
[AI-43] Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence
【速读】:该论文旨在解决在训练数据呈现马尔可夫依赖(Markov dependence)情形下,传统多数投票集成方法(majority-vote ensembles)的泛化性能下降问题,尤其是当数据来自固定维度的马尔可夫链时,现有理论无法充分刻画其方差缩减失效的现象。解决方案的关键在于提出一种自适应谱路由(adaptive spectral routing)算法:该算法通过利用训练数据的依赖图的实证Fiedler特征向量进行分块划分,从而在不依赖混合时间(\Tmix)的情况下,在图正则子类上实现了最优的最小最大风险率 \mathcal{O}(\sqrt{\Tmix}/n),并显著缩小了依赖无关均匀自助法(uniform bagging)所表现出的 \sqrt{\Tmix} 算法差距。
链接: https://arxiv.org/abs/2604.13414
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than \Omega(\sqrt\Tmix/n) . We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by \Omega(\Tmix/\sqrtn) , exhibiting a \sqrt\Tmix algorithmic gap. Finally, we propose \emphadaptive spectral routing, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate \mathcalO(\sqrt\Tmix/n) up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of \Tmix . Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via Nyström approximation, and bounded non-stationarity are developed as supporting material in the appendix.
[AI-44] Quantifying and Understanding Uncertainty in Large Reasoning Models
【速读】:该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在生成答案时的不确定性量化问题,特别是传统方法缺乏对推理链与最终答案之间逻辑关联的考虑,且无法提供有限样本下的统计保证。此外,现有方法难以区分推理质量与答案正确性,并忽视了驱动有效推理的具体训练因素。解决方案的关键在于提出一种具有统计保障的新型不确定性量化方法,能够同时刻画推理路径与最终答案的结构化不确定性;并进一步构建一个基于Shapley值的统一示例到步骤解释框架,识别出能保持上述统计保障的最小训练样本及其关键推理步骤子集,从而实现计算高效且理论可证明的解释能力。
链接: https://arxiv.org/abs/2604.13395
作者: Yangyi Li,Chenxu Zhao,Mengdi Huai
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Reasoning Models (LRMs) have recently demonstrated significant improvements in complex reasoning. While quantifying generation uncertainty in LRMs is crucial, traditional methods are often insufficient because they do not provide finite-sample guarantees for reasoning-answer generation. Conformal prediction (CP) stands out as a distribution-free and model-agnostic methodology that constructs statistically rigorous uncertainty sets. However, existing CP methods ignore the logical connection between the reasoning trace and the final answer. Additionally, prior studies fail to interpret the origins of uncertainty coverage for LRMs as they typically overlook the specific training factors driving valid reasoning. Notably, it is challenging to disentangle reasoning quality from answer correctness when quantifying uncertainty, while simultaneously establishing theoretical guarantees for computationally efficient explanation methods. To address these challenges, we first propose a novel methodology that quantifies uncertainty in the reasoning-answer structure with statistical guarantees. Subsequently, we develop a unified example-to-step explanation framework using Shapley values that identifies a provably sufficient subset of training examples and their key reasoning steps to preserve the guarantees. We also provide theoretical analyses of our proposed methods. Extensive experiments on challenging reasoning datasets verify the effectiveness of the proposed methods.
[AI-45] ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
【速读】:该论文旨在解决高风险领域(如医疗和金融)中表格数据预测模型面临的双重挑战:一是如何实现可扩展的数据集构建,二是确保模型推理过程的一致性与可解释性。传统符号模型虽具备逻辑可验证性但语义表达能力弱,而通用大语言模型(LLM)在处理特定表格推理任务时往往需要复杂的微调;为此,作者提出 ReSS 框架,其核心在于利用决策树模型提取实例级决策路径作为符号结构(symbolic scaffolds),并以此引导预训练 LLM 生成严格遵循底层决策逻辑的自然语言推理文本,从而构建高质量的标注数据集用于微调,最终获得兼具高精度与忠实推理能力的专用表格推理模型。
链接: https://arxiv.org/abs/2604.13392
作者: Chenlang Yi,Gang Li,Zizhan Xiong,Tue Minh Cao,Yanmin Gong,My T. Thai,Tianbao Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to 10% while producing faithful and consistent reasoning
[AI-46] On the Use of Evolutionary Optimization for the Dynamic Chance Constrained Open-Pit Mine Scheduling Problem
【速读】:该论文旨在解决动态环境下带有随机经济价值的露天矿调度(open-pit mine scheduling)问题,其中块体经济价值具有不确定性,且采矿与选矿能力随时间变化。此类问题属于复杂的真实世界多目标优化挑战,传统方法难以同时处理不确定性和动态性。解决方案的关键在于提出一种基于多样性的变化响应机制(diversity-based change response mechanism),该机制在检测到环境变化时,修复部分不可行解并引入额外可行解以维持种群多样性,从而提升算法对动态扰动的适应能力。结合双目标进化框架(最大化期望贴现利润并最小化其标准差),实验表明该方法在不同不确定性水平和变化频率下均显著优于基于重新评估的基线策略。
链接: https://arxiv.org/abs/2604.13385
作者: Ishara Hewa Pathiranage,Aneta Neumann
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Accepted to publish in 2026 IEEE World Congress on Computational Intelligence (WCCI)
Abstract:Open-pit mine scheduling is a complex real world optimization problem that involves uncertain economic values and dynamically changing resource capacities. Evolutionary algorithms are particularly effective in these scenarios, as they can easily adapt to uncertain and changing environments. However, uncertainty and dynamic changes are often studied in isolation in real-world problems. In this paper, we study a dynamic chance-constrained open-pit mine scheduling problem in which block economic values are stochastic and mining and processing capacities vary over time. We adopt a bi-objective evolutionary formulation that simultaneously maximizes expected discounted profit and minimizes its standard deviation. To address dynamic changes, we propose a diversity-based change response mechanism that repairs a subset of infeasible solutions and introduces additional feasible solutions whenever a change is detected. We evaluate the effectiveness of this mechanism across four multi-objective evolutionary algorithms and compare it with a baseline re-evaluation-based change-response strategy. Experimental results on six mining instances demonstrate that the proposed approach consistently outperforms the baseline methods across different uncertainty levels and change frequencies.
[AI-47] Listening Alone Understanding Together: Collaborative Context Recovery for Privacy-Aware AI
【速读】:该论文旨在解决始终监听的语音助手(always-listening assistants)在社会部署中面临的隐私风险问题,即在未获得非目标说话者同意的情况下捕获其语音内容。解决方案的关键在于提出CONCORD框架,该框架通过实时说话人验证(speaker verification)实现仅对所有者语音进行采集,从而保障隐私;同时,它将上下文恢复视为一个由关系感知披露机制控制的、助理间协商的安全交换过程,而非易产生幻觉的推断行为,并借助时空上下文解析、信息缺口检测和最小化助理间查询三类策略,在保证隐私的前提下有效恢复缺失上下文信息。
链接: https://arxiv.org/abs/2604.13348
作者: Tanmay Srivastava,Amartya Basu,Shubham Jain,Vaishnavi Ranganathan
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:We introduce CONCORD, a privacy-aware asynchronous assistant-to-assistant (A2A) framework that leverages collaboration between proactive speech-based AI. As agents evolve from reactive to always-listening assistants, they face a core privacy risk (of capturing non-consenting speakers), which makes their social deployment a challenge. To overcome this, we implement CONCORD, which enforces owner-only speech capture via real-time speaker verification, producing a one-sided transcript that incurs missing context but preserves privacy. We demonstrate that CONCORD can safely recover necessary context through (1) spatio-temporal context resolution, (2) information gap detection, and (3) minimal A2A queries governed by a relationship-aware disclosure. Instead of hallucination-prone inferring, CONCORD treats context recovery as a negotiated safe exchange between assistants. Across a multi-domain dialogue dataset, CONCORD achieves 91.4% recall in gap detection, 96% relationship classification accuracy, and 97% true negative rate in privacy-sensitive disclosure decisions. By reframing always-listening AI as a coordination problem between privacy-preserving agents, CONCORD offers a practical path toward socially deployable proactive conversational agents.
[AI-48] Beyond Uniform Sampling: Synergistic Active Learning and Input Denoising for Robust Neural Operators
【速读】:该论文旨在解决神经算子(Neural Operators)在物理仿真中作为快速代理模型时,对对抗扰动极度敏感的问题,这一缺陷严重制约其在安全关键型数字孪生系统中的部署。解决方案的关键在于提出一种协同防御机制:一方面通过基于差分进化攻击的主动学习策略,自适应地探测模型脆弱性并生成针对性训练数据;另一方面引入可学习瓶颈结构的输入去噪架构,以过滤对抗噪声同时保留物理相关特征。该方法在黏性Burgers方程基准测试中实现了2.04%的综合误差(基线1.21% + 鲁棒性0.83%),较标准训练(15.42%)降低87%,显著优于单一主动学习(3.42%)或输入去噪(5.22%)方案。
链接: https://arxiv.org/abs/2604.13316
作者: Samrendra Roy,Souvik Chakraborty,Syed Bahauddin Alam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural operators have emerged as fast surrogate models for physics simulations, yet they remain acutely vulnerable to adversarial perturbations, a critical liability for safety-critical digital twin deployments. We present a synergistic defense that combines active learning-based data generation with an input denoising architecture. The active learning component adaptively probes model weaknesses using differential evolution attacks, then generates targeted training data at discovered vulnerability locations while an adaptive smooth-ratio safeguard preserves baseline accuracy. The input denoising component augments the operator architecture with a learnable bottleneck that filters adversarial noise while retaining physics-relevant features. On the viscous Burgers’ equation benchmark, the combined approach achieves a 2.04% combined error (1.21% baseline + 0.83% robustness), representing an 87% reduction relative to standard training (15.42% combined) and outperforming both active learning alone (3.42%) and input denoising alone (5.22%). More broadly, our results, combined with cross-architecture vulnerability analysis from prior work, suggest that optimal training data for neural operators is architecture-dependent: because different architectures concentrate sensitivity in distinct input subspaces, uniform sampling cannot adequately cover the vulnerability landscape of all models. These findings have potential implications for the deployment of neural operators in safety-critical energy systems including nuclear reactor monitoring.
[AI-49] Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach
【速读】:该论文旨在解决地球观测(Earth Observation, EO)卫星任务调度中存在未知约束的问题,即在目标函数已知但可行性需通过二元查询接口(binary oracle)交互学习的情况下进行优化。传统方法依赖于预先明确的数学约束模型,而实际工程中约束常隐含于高保真仿真器或设计文档中,难以显式建模。解决方案的关键在于提出一种名为保守约束获取(Conservative Constraint Acquisition, CCA)的领域特定机制,嵌入到学习-优化框架(Learn-Optimize, L\O)中,通过交替执行基于当前学习约束模型的优化与针对性的oracle查询,高效识别必要约束并避免过度收紧可行域。实验表明,L\O在50个任务规模下显著优于无知识贪心基线和两阶段获取-求解基线(FAO),平均间隙从65–68%降至17.7–35.8%,且查询次数减少至FAO的约21%,计算时间降低约5倍。
链接: https://arxiv.org/abs/2604.13283
作者: Mohamed-Bachir Belaid
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Earth Observation (EO) satellite scheduling (deciding which imaging tasks to perform and when) is a well-studied combinatorial optimization problem. Existing methods typically assume that the operational constraint model is fully specified in advance. In practice, however, constraints governing separation between observations, power budgets, and thermal limits are often embedded in engineering artefacts or high-fidelity simulators rather than in explicit mathematical models. We study EO scheduling under \emphunknown constraints: the objective is known, but feasibility must be learned interactively from a binary oracle. Working with a simplified model restricted to pairwise separation and global capacity constraints, we introduce Conservative Constraint Acquisition~(CCA), a domain-specific procedure designed to identify justified constraints efficiently in practice while limiting unnecessary tightening of the learned model. Embedded in the \textscLearn\Optimize framework, CCA supports an interactive search process that alternates optimization under a learned constraint model with targeted oracle queries. On synthetic instances with up to 50~tasks and dense constraint networks, L\O improves over a no-knowledge greedy baseline and uses far fewer main oracle queries than a two-phase acquire-then-solve baseline (FAO). For n\leq 30 , the average gap drops from 65–68% (Priority Greedy) to 17.7–35.8% using L\O. At n=50 , where the CP-SAT reference is the best feasible solution found in 120~s, L\O improves on FAO on average (17.9% vs.\ 20.3%) while using 21.3 main queries instead of 100 and about 5\times less execution time.
[AI-50] Out of Context: Reliability in Multimodal Anomaly Detection Requires Contextual Inference
【速读】:该论文旨在解决传统异常检测方法在动态和异构环境中因假设正常行为由单一无条件参考分布定义而导致的结构性模糊问题,即无法区分上下文变化与真实异常,从而引发性能不稳定和异常评估不可靠。其解决方案的关键在于将多模态异常检测重新建模为跨模态上下文推理问题,通过区分上下文信息(context)与异常相关信号(anomaly-relevant signals),实现条件化异常定义而非依赖全局单一参考标准,从而提升模型对环境变化的鲁棒性与异常识别的准确性。
链接: https://arxiv.org/abs/2604.13252
作者: Kevin Wilkinghoff,Neelu Madan,Juan Miguel Valverde,Kamal Nasrollahi,Radu Tudor Ionescu,Rafal Wisniewski,Thomas B. Moeslund,Wenwu Wang,Zheng-Hua Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Anomaly detection aims to identify observations that deviate from expected behavior. Because anomalous events are inherently sparse, most frameworks are trained exclusively on normal data to learn a single reference model of normality. This implicitly assumes that normal behavior can be captured by a single, unconditional reference distribution. In practice, however, anomalies are often context-dependent: A specific observation may be normal under one operating condition, yet anomalous under another. As machine learning systems are deployed in dynamic and heterogeneous environments, these fixed-context assumptions introduce structural ambiguity, i.e., the inability to distinguish contextual variation from genuine abnormality under marginal modeling, leading to unstable performance and unreliable anomaly assessments. While modern sensing systems frequently collect multimodal data capturing complementary aspects of both system behavior and operating conditions, existing methods treat all data streams equally, without distinguishing contextual information from anomaly-relevant signals. As a result, abnormality is often evaluated without explicitly conditioning on operating conditions. We argue that multimodal anomaly detection should be reframed as a cross-modal contextual inference problem, in which modalities play asymmetric roles, separating context from observation, to define abnormality conditionally rather than relative to a single global reference. This perspective has implications for model design, evaluation protocols, and benchmark construction, and outline open research challenges toward robust, context-aware multimodal anomaly detection.
[AI-51] GeoVision-Enabled Digital Twin for Hybrid Autonomous-Teleoperated Medical Responses
【速读】:该论文旨在解决灾难场景或基础设施受限环境中远程医疗响应系统在实时感知、决策支持与操作协同方面的不足。传统地面控制界面难以提供充分的情境意识和动态交互能力,导致应急医疗响应效率受限。解决方案的关键在于构建一个基于GeoVision能力的数字孪生(Digital Twin)架构,通过实时同步平台状态、环境动态、患者状况及任务目标,实现感知与自适应导航的融合,为远程临床和操作人员提供直观、持续更新的虚拟映射,从而提升情境感知能力和决策科学性。
链接: https://arxiv.org/abs/2604.13248
作者: Parham Kebria,Soheil Sabri,Laura J Brattain
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Remote medical response systems are increasingly being deployed to support emergency care in disaster-affected and infrastructure-limited environments. Enabled by GeoVision capabilities, this paper presents a Digital Twin architecture for hybrid autonomous-teleoperated medical response systems. The proposed framework integrates perception and adaptive navigation with a Digital Twin, synchronized in real-time, that mirrors system states, environmental dynamics, patient conditions, and mission objectives. Unlike traditional ground control interfaces, the Digital Twin provides remote clinical and operational users with an intuitive, continuously updated virtual representation of the platform and its operational context, enabling enhanced situational awareness and informed decision-making.
[AI-52] On the Creativity of AI Agents
【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在智能体系统(agentic systems)中的应用已展现出接近甚至超越人类的性能,但其是否真正具备创造力仍存在争议,这一争议源于对创造力定义、评估方法及具体应用场景的不同理解。论文的关键解决方案在于提出两个互补的宏观层面分析框架——功能主义视角(functionalist perspective)与本体论视角(ontological perspective),分别聚焦于创造性输出的可观测特征和背后的过程、社会与个人维度。研究表明,LLM智能体表现出功能性创造力,但在本体论意义上的核心要素(如意图性、情感驱动、自我意识等)仍缺失;因此,论文进一步探讨了实现兼具两种创造力形式的人工智能系统的可行性与伦理影响,并提出有助于增强人类社会的人工创造力发展路径。
链接: https://arxiv.org/abs/2604.13242
作者: Giorgio Franceschelli,Mirco Musolesi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs), particularly when integrated into agentic systems, have demonstrated human- and even superhuman-level performance across multiple domains. Whether these systems can truly be considered creative, however, remains a matter of debate, as conclusions heavily depend on the definitions, evaluation methods, and specific use cases employed. In this paper, we analyse creativity along two complementary macro-level perspectives. The first is a functionalist perspective, focusing on the observable characteristics of creative outputs. The second is an ontological perspective, emphasising the underlying processes, as well as the social and personal dimensions involved in creativity. We focus on LLM agents and we argue that they exhibit functionalist creativity, albeit not at its most sophisticated levels, while they continue to lack key aspects of ontological creativity. Finally, we discuss whether it is desirable for agentic systems to attain both forms of creativity, evaluating potential benefits and risks, and proposing pathways toward artificial creativity that can enhance human society.
[AI-53] KV Packet: Recomputation-Free Context-Independent KV Caching for LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因Key-Value (KV) 缓存的上下文依赖性而导致的效率瓶颈问题。标准KV缓存无法直接复用,若在新上下文中使用需重新计算注意力分布,造成显著的计算开销(FLOPs)和首次词元延迟(Time-to-First-Token, TTFT)。其解决方案的关键在于提出KV Packet框架,将缓存文档视为不可变的“数据包”,并用轻量级可训练软令牌适配器(soft-token adapters)封装,通过自监督蒸馏训练使这些适配器能够自动弥合不同上下文间的语义断层,从而实现无需重计算的高效缓存复用。实验表明,该方法在保持高准确率的同时,显著降低了FLOPs和TTFT。
链接: https://arxiv.org/abs/2604.13226
作者: Chuangtao Chen,Grace Li Zhang,Xunzhao Yin,Cheng Zhuo,Bing Li,Ulf Schlichtmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets’’ wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.
[AI-54] Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代理工作流中因浮点数表示的有限精度导致的数值不稳定性所引发的不可预测性问题。其解决方案的关键在于系统性地分析了舍入误差如何在Transformer计算层中传播、放大或衰减,并首次揭示了早期层中存在的混沌“雪崩效应”——微小扰动会引发二元结果:要么迅速放大,要么完全衰减。研究进一步识别出LLMs具有尺度依赖性的普遍混沌行为,分为三个典型区域:稳定区(扰动低于输入相关阈值并消失)、混沌区(舍入误差主导输出发散)和信号主导区(真实输入变化压制数值噪声),并通过多数据集与模型架构验证了这一机制。
链接: https://arxiv.org/abs/2604.13206
作者: Chashi Mahiul Islam,Alan Villarreal,Mao Nishino,Shaeke Salman,Xiuwen Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注: 8 pages, 9 figures
Abstract:As Large Language Models (LLMs) are increasingly integrated into agentic workflows, their unpredictability stemming from numerical instability has emerged as a critical reliability issue. While recent studies have demonstrated the significant downstream effects of these instabilities, the root causes and underlying mechanisms remain poorly understood. In this paper, we present a rigorous analysis of how unpredictability is rooted in the finite numerical precision of floating-point representations, tracking how rounding errors propagate, amplify, or dissipate through Transformer computation layers. Specifically, we identify a chaotic “avalanche effect” in the early layers, where minor perturbations trigger binary outcomes: either rapid amplification or complete attenuation. Beyond specific error instances, we demonstrate that LLMs exhibit universal, scale-dependent chaotic behaviors characterized by three distinct regimes: 1) a stable regime, where perturbations fall below an input-dependent threshold and vanish, resulting in constant outputs; 2) a chaotic regime, where rounding errors dominate and drive output divergence; and 3) a signal-dominated regime, where true input variations override numerical noise. We validate these findings extensively across multiple datasets and model architectures.
[AI-55] SciFi: A Safe Lightweight User-Friendly and Fully Autonomous Agent ic AI Workflow for Scientific Applications
【速读】:该论文旨在解决当前基于智能体的AI(agentic AI)系统在真实科学研究所中难以实现可靠部署的问题。其核心挑战在于如何在保证安全性与稳定性的前提下,实现对结构化科学任务的自主执行,同时适应不同能力水平的大语言模型(large language models, LLMs)。解决方案的关键在于提出一个轻量、安全且用户友好的智能体框架,该框架包含隔离执行环境、三层代理循环(three-layer agent loop)以及自评估的“do-until”机制,从而在最小人工干预下实现端到端自动化,使研究人员能够专注于创造性研究与开放性科学探索。
链接: https://arxiv.org/abs/2604.13180
作者: Qibin Liu,Julia Gonski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in agentic AI have enabled increasingly autonomous workflows, but existing systems still face substantial challenges in achieving reliable deployment in real-world scientific research. In this work, we present a safe, lightweight, and user-friendly agentic framework for the autonomous execution of well-defined scientific tasks. The framework combines an isolated execution environment, a three-layer agent loop, and a self-assessing do-until mechanism to ensure safe and reliable operation while effectively leveraging large language models of varying capability levels. By focusing on structured tasks with clearly defined context and stopping criteria, the framework supports end-to-end automation with minimal human intervention, enabling researchers to offload routine workloads and devote more effort to creative activities and open-ended scientific inquiry.
[AI-56] Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
【速读】:该论文旨在解决多目标强化学习(Multi-Objective Reinforcement Learning, Multi-Objective RL)中因线性奖励标量化(Linear Reward Scalarization)无法捕获非凸帕累托前沿(Pareto front)而导致的对齐效果受限问题。其关键解决方案是将多目标RL建模为一个可标量化的优化问题,并采用平滑切比雪夫标量化(Smooth Tchebysheff Scalarization)替代传统线性标量化,从而有效覆盖非凸帕累托区域;在此基础上提出STOMP算法,通过基于观测分布标准化各目标奖励,实现了对多属性蛋白质优化等复杂任务的稳健对齐,在多个实验场景下显著优于现有最先进基线方法。
链接: https://arxiv.org/abs/2604.13175
作者: Aadyot Bhatnagar,Peter Mørch Groth,Ali Madani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state-of-the-art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off-policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and beyond.
[AI-57] Exploration and Exploitation Errors Are Measurable for Language Model Agents
【速读】:该论文旨在解决在复杂开放决策任务中,如何在不访问智能体内部策略的情况下,系统性地区分并量化探索(Exploration)与利用(Exploitation)行为的问题。其关键解决方案是设计了一类受实际具身AI场景启发的可控环境,这些环境包含部分可观测的二维网格地图和未知的任务有向无环图(Directed Acyclic Graph, DAG),并通过程序化调整地图生成机制来分别强化探索或利用的难度;同时提出一种策略无关的度量方法,从智能体动作中量化探索误差与利用误差,从而实现对多种前沿语言模型代理在探索与利用能力上的统一评估。
链接: https://arxiv.org/abs/2604.13151
作者: Jaden Park,Jungtaek Kim,Jongwon Jeong,Robert D. Nowak,Kangwook Lee,Yong Jae Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent’s internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent’s actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code \hrefthis https URLhere.
[AI-58] Spectral Entropy Collapse as an Empirical Signature of Delayed Generalisation in Grokking
【速读】:该论文旨在解决生成式 AI(Generative AI)中“grokking”现象——即模型在完成记忆之后延迟出现泛化能力——缺乏可预测的机制解释这一问题。其关键解决方案是识别出表示协方差的归一化谱熵(normalized spectral entropy, H~(t))作为该过渡过程的标量序参量(order parameter),并通过实验验证其在单层 Transformer 模型上对群论任务的有效性。研究发现,grokking 严格遵循两阶段模式:先发生范数扩张,随后是熵坍缩;且 H~ 在所有实验运行中均在泛化发生前跨越一个稳定阈值 H~∗≈0.61(平均提前 1,020 步),因果干预阻止熵坍缩可显著延迟 grokking(+5,020 步,p=0.044),而范数匹配对照组进一步确认熵而非范数驱动该转变。该机制在阿贝尔群(Z/97Z)和非阿贝尔群(S5)中一致成立,但多层感知机(MLP)仅呈现熵坍缩却不发生 grokking,表明熵坍缩是必要条件但非充分条件,架构特性决定最终是否出现泛化跃迁。
链接: https://arxiv.org/abs/2604.13123
作者: Truong Xuan Khanh,Truong Quynh Hoa,Luu Duc Trung,Phan Thanh Duc
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figs, 7 tables
Abstract:Grokking – delayed generalisation long after memorisation – lacks a predictive mechanistic explanation. We identify the normalised spectral entropy \tildeH(t) of the representation covariance as a scalar order parameter for this transition, validated on 1-layer Transformers on group-theoretic tasks. Five contributions: (i) Grokking follows a two-phase pattern: norm expansion then entropy collapse. (ii) \tildeH crosses a stable threshold \tildeH^* \approx 0.61 before generalisation in 100% of runs (mean lead: 1,020 steps). (iii) A causal intervention preventing collapse delays grokking by +5,020 steps ( p=0.044 ); a norm-matched control ( n=30 , p=5\times10^-5 ) confirms entropy – not norm – drives the transition. (iv) A power-law \Delta T = C_1(\tildeH-\tildeH^*)^\gamma+C_2 ( R^2=0.543 ) predicts grokking onset with 4.1% error. (v) The mechanism holds across abelian ( \mathbbZ/97\mathbbZ ) and non-abelian ( S_5 ) groups. Crucially, MLPs show entropy collapse without grokking, proving collapse is necessary but not sufficient – architecture matters. Code: this https URL
[AI-59] Agent Forge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)生成代码时缺乏正确性验证的问题,现有多智能体系统要么模拟执行,要么将验证作为可选步骤,难以确保代码质量。其解决方案的关键在于提出“执行锚定验证”(execution-grounded verification)原则:任何代码变更必须在沙箱环境中通过执行验证后方可传播。作者在AGENTFORGE框架中实现该原则,通过Planner、Coder、Tester、Debugger和Critic五个角色智能体协作,利用共享内存和强制Docker沙箱机制,将软件工程过程形式化为基于仓库状态的迭代决策过程,其中执行反馈提供比传统next-token概率更强的监督信号,从而显著提升代码修复准确率(SWE-BENCH Lite上达40.0%,优于单智能体基线26–28点)。
链接: https://arxiv.org/abs/2604.13120
作者: Rajesh Kumar,Waqar Ali,Junaid Ahmed,Najma Imtiaz Ali,Shaban Usman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models generate plausible code but cannot verify correctness. Existing multi-agent systems simulate execution or leave verification optional. We introduce execution-grounded verification as a first-class principle: every code change must survive sandboxed execution before propagation. We instantiate this principle in AGENTFORGE, a multi-agent framework where Planner, Coder, Tester, Debugger, and Critic agents coordinate through shared memory and a mandatory Docker sandbox. We formalize software engineering with LLMs as an iterative decision process over repository states, where execution feedback provides a stronger supervision signal than next-token likelihood. AGENTFORGE achieves 40.0% resolution on SWE-BENCH Lite, outperforming single-agent baselines by 26–28 points. Ablations confirm that execution feedback and role decomposition each independently drive performance. The framework is open-source at this https URL.
[AI-60] he Code Whisperer: LLM and Graph-Based AI for Smell and Vulnerability Resolution
【速读】:该论文旨在解决代码异味(code smells)与软件漏洞(software vulnerabilities)在维护过程中因依赖独立工具而缺乏结构上下文、产生噪声警告的问题。其解决方案的关键在于提出一个名为“Code Whisperer”的混合框架,该框架通过联合学习抽象语法树(Abstract Syntax Trees, ASTs)、控制流图(Control Flow Graphs, CFGs)、程序依赖图(Program Dependency Graphs, PDGs)以及令牌级代码嵌入(token-level code embeddings),实现结构信息与语义信息的协同建模,从而在统一工作流中完成检测、解释与修复任务。实验表明,该混合设计显著提升了检测性能,并生成更实用的修复建议,优于仅使用图分析或仅依赖大语言模型的方法。
链接: https://arxiv.org/abs/2604.13114
作者: Mohammad Baqar,Raji Rustamov,Alexander Hughes
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 Pages
Abstract:Code smells and software vulnerabilities both increase maintenance cost, yet they are often handled by separate tools that miss structural context and produce noisy warnings. This paper presents The Code Whisperer, a hybrid framework that combines graph-based program analysis with large language models to detect, explain, and repair maintainability and security issues within a unified workflow. The method aligns Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), Program Dependency Graphs (PDGs), and token-level code embeddings so that structural and semantic signals can be learned jointly. We evaluate the framework on multi-language datasets and compare it with rule-based analyzers and single-model baselines. The results indicate that the hybrid design improves detection performance and produces more useful repair suggestions than either graph-only or language-model-only approaches. We also examine explainability and CI/CD integration as practical requirements for adopting AI-assisted code review in everyday software engineering workflows.
[AI-61] Applying an Agent ic Coding Tool for Improving Published Algorithm Implementations
【速读】:该论文旨在解决现有算法实现中存在改进空间但缺乏系统性优化方法的问题,即如何借助人工智能(AI)高效、可靠地提升已发表算法的性能。其解决方案的关键在于提出一个两阶段流水线:第一阶段利用具备研究能力的大语言模型(Large Language Model, LLM)自动筛选符合特定实验标准的最新算法;第二阶段由Claude Code基于提示(prompt)复现基线并进行迭代式改进。实验证明该方法在多个研究领域均能实现有效提升,且单次改进可在一天内完成,凸显了AI辅助科研在算法优化中的潜力与可行性。
链接: https://arxiv.org/abs/2604.13109
作者: Worasait Suwannik
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a two-stage pipeline for AI-assisted improvement of published algorithm implementations. In the first stage, a large language model with research capabilities identifies recently published algorithms satisfying explicit experimental criteria. In the second stage, Claude Code is given a prompt to reproduce the reported baseline and then iterate an improvement process. We apply this pipeline to published algorithm implementations spanning multiple research domains. Claude Code reported that all eleven experiments yielded improvements. Each improvement could be achieved within a single working day. We analyse the human contributions that remain indispensable, including selecting the target, verifying experimental validity, assessing novelty and impact, providing computational resources, and writing with appropriate AI-use disclosure. Finally, we discuss implications for peer review and academic publishing.
[AI-62] Formal Architecture Descriptors as Navigation Primitives for AI Coding Agents
【速读】:该论文旨在解决生成式 AI 编码代理(AI coding agents)在代码库中进行无方向探索时消耗大量工具调用的问题,从而降低导航开销。其核心解决方案是为代理提供形式化的架构描述符(architecture descriptor),以提升其定位和访问代码结构的效率与准确性。关键发现包括:形式化描述可使导航步骤减少33%-44%,自动生成功能实现100%的准确率,且基于S-expression格式的描述在错误检测方面表现最优,能够识别所有结构性完整性错误,而其他格式如YAML存在静默数据损坏风险。因此,该研究提出使用S-expression格式的架构描述符,并开源了Forge工具包,以支持高效、可靠的代码库导航。
链接: https://arxiv.org/abs/2604.13108
作者: Ruoqi Jin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 4 pages, 4 tables, preprint. Code and data: this https URL
Abstract:AI coding agents spend a substantial fraction of their tool calls on undirected codebase exploration. We investigate whether providing agents with formal architecture descriptors can reduce this navigational overhead. We present three complementary studies. First, a controlled experiment (24 code localization tasks x 4 conditions, Claude Sonnet 4.6, temperature=0) demonstrates that architecture context reduces navigation steps by 33-44% (Wilcoxon p=0.009, Cohen’s d=0.92), with no significant format difference detected across S-expression, JSON, YAML, and Markdown. Second, an artifact-vs-process experiment (15 tasks x 3 conditions) demonstrates that an automatically generated descriptor achieves 100% accuracy versus 80% blind (p=0.002, d=1.04), proving direct navigational value independent of developer self-clarification. Third, an observational field study across 7,012 Claude Code sessions shows 52% reduction in agent behavioral variance. A writer-side experiment (96 generation runs, 96 error injections) reveals critical failure mode differences: JSON fails atomically, YAML silently corrupts 50% of errors, S-expressions detect all structural completeness errors. We propose this http URL, an S-expression architecture descriptor, and open-source the Forge toolkit.
[AI-63] Can Coding Agents Be General Agents ?
【速读】:该论文旨在解决编码代理(coding agent)在超出软件工程范畴后,能否有效泛化至端到端企业业务流程自动化的问题。研究发现,当前评估体系存在不足,且编码代理在复杂任务中表现出特定失败模式,表明连接领域逻辑与代码执行是制约其泛化能力的关键瓶颈。解决方案之关键在于识别并优化这一瓶颈环节,以提升编码代理在真实业务场景中的可靠性和适应性。
链接: https://arxiv.org/abs/2604.13107
作者: Maksim Ivanov,Abhijay Rana,Gokul Prabhakaran
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As coding agents have seen rapid capability and adoption gains, users are applying them to general tasks beyond software engineering. In this post, we investigate whether coding agents can successfully generalize to end-to-end business process automation. We identify gaps in current evaluations, and conduct a case study to evaluate a coding agent on practical business tasks in an open-core Enterprise Resource Planning system. We find that the agent reliably completes simple tasks but exhibits characteristic failures on complex tasks, suggesting that bridging domain logic and code execution is a key bottleneck to generalizability.
[AI-64] CCCE: A Continuous Code Calibration Engine for Autonomous Enterprise Codebase Maintenance via Knowledge Graph Traversal and Adaptive Decision Gating
【速读】:该论文旨在解决企业级软件组织在维护数百个代码库、多种编程语言及数千个相互依赖的包时面临的挑战,这些问题包括代码完整性、安全性与新鲜度的保障,而现有方法如静态分析、软件组成分析(Software Composition Analysis, SCA)和依赖管理工具因孤立运作、覆盖范围有限且高度依赖人工干预,难以有效应对复杂系统的协同维护需求。解决方案的核心是提出连续代码校准引擎(Continuous Code Calibration Engine, CCCE),其关键技术突破在于:(1) 基于动态知识图谱的双向遍历算法,同步实现前向影响传播与后向测试充分性分析;(2) 自适应多阶段门控机制,通过学习得到的风险-置信度评分对校准操作进行四类风险分级,取代静态规则;(3) 多模型持续学习架构,在多个时间尺度上从运行反馈中迭代优化校准策略、风险模型与组织政策,从而实现跨仓库协调校准与人机协同(Human-in-the-Loop, HITL)监督下的自动化修复,生成语义验证的原子补丁并具备渐进式验证与智能回滚能力,确保端到端可追溯性。
链接: https://arxiv.org/abs/2604.13102
作者: Santhosh Kusuma Kumar Parimi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Enterprise software organizations face an escalating challenge in maintaining the integrity, security, and freshness of codebases that span hundreds of repositories, multiple programming languages, and thousands of interdependent packages. Existing approaches to codebase maintenance – including static analysis, software composition analysis (SCA), and dependency management tools – operate in isolation, address only narrow subsets of maintenance concerns, and require substantial manual intervention to propagate changes across interconnected systems. We present the Continuous Code Calibration Engine (CCCE), an event-driven, AI-agentic system that autonomously maintains enterprise codebases throughout the Software Development Life Cycle (SDLC). The CCCE introduces three key technical innovations: (1) a dynamic knowledge graph with bidirectional traversal algorithms that simultaneously compute forward impact propagation and backward test adequacy analysis; (2) an adaptive multi-stage gating framework that classifies calibration actions into four risk tiers using learned risk-confidence scoring rather than static rules; and (3) a multi-model continuous learning architecture operating at multiple temporal scales to refine calibration strategies, risk models, and organizational policies from operational feedback. We formalize the system’s graph model, traversal algorithms, and decision logic, and demonstrate through three representative enterprise scenarios that the CCCE reduces mean time to remediation by enabling coordinated, cross-repository calibrations with human-in-the-loop (HITL) oversight where appropriate. The system generates atomic, semantically verified patches with progressive validation and intelligent rollback capabilities, providing end-to-end traceability from triggering events through calibration execution and outcome learning.
[AI-65] Building Trust in the Skies: A Knowledge-Grounded LLM -based Framework for Aviation Safety
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在航空安全决策中因事实错误、幻觉(hallucination)及不可验证性等问题导致的可靠性不足问题,这些问题在安全关键环境中可能引发灾难性后果。解决方案的关键在于提出一个端到端的融合框架,将LLMs与知识图谱(Knowledge Graphs, KGs)协同使用:首先利用LLMs从多模态数据源自动构建并动态更新航空安全知识图谱(Aviation Safety Knowledge Graph, ASKG),随后在检索增强生成(Retrieval-Augmented Generation, RAG)架构中利用该知识图谱对LLM生成结果进行 grounding、验证和解释,从而提升安全性分析的准确性与可追溯性,有效缓解幻觉问题,满足航空业对高可靠性的严格要求。
链接: https://arxiv.org/abs/2604.13101
作者: Anirudh Iyengar,Alisa Tiselska,Dumindu Samaraweera,Hong Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Initial version of a conference publication
Abstract:The integration of Large Language Models (LLMs) into aviation safety decision-making represents a significant technological advancement, yet their standalone application poses critical risks due to inherent limitations such as factual inaccuracies, hallucination, and lack of verifiability. These challenges undermine the reliability required for safety-critical environments where errors can have catastrophic consequences. To address these challenges, this paper proposes a novel, end-to-end framework that synergistically combines LLMs and Knowledge Graphs (KGs) to enhance the trustworthiness of safety analytics. The framework introduces a dual-phase pipeline: it first employs LLMs to automate the construction and dynamic updating of an Aviation Safety Knowledge Graph (ASKG) from multimodal sources. It then leverages this curated KG within a Retrieval-Augmented Generation (RAG) architecture to ground, validate, and explain LLM-generated responses. The implemented system demonstrates improved accuracy and traceability over LLM-only approaches, effectively supporting complex querying and mitigating hallucination. Results confirm the framework’s capability to deliver context-aware, verifiable safety insights, addressing the stringent reliability requirements of the aviation industry. Future work will focus on enhancing relationship extraction and integrating hybrid retrieval mechanisms.
[AI-66] Contract-Coding: Towards Repo-Level Generation via Structured Symbolic Paradigm
【速读】:该论文旨在解决意图驱动的软件工程(Intent-Driven Software Engineering,常被称为“Vibe Coding”)中因用户意图模糊导致的上下文保真度权衡问题:模糊意图在复杂代码库级别的生成过程中会压垮线性推理链,进而引发架构崩溃。其解决方案的关键在于提出一种结构化的符号范式——Contract-Coding,通过自主符号锚定(Autonomous Symbolic Grounding)将非结构化意图映射为形式化的语言契约(Language Contract),从而构建单一事实来源(Single Source of Truth, SSOT),实现模块间拓扑独立性,隔离实现细节、降低执行深度,并释放架构并行性(Architectural Parallelism)。实证结果表明,在Greenfield-5基准测试中,相比现有先进代理模型存在多种幻觉问题,Contract-Coding实现了47%的功能成功率且保持近乎完美的结构完整性,标志着向仓库级自主工程迈进的重要一步。
链接: https://arxiv.org/abs/2604.13100
作者: Yi Lin,Lujin Zhao,Yijie Shi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The shift toward intent-driven software engineering (often termed “Vibe Coding”) exposes a critical Context-Fidelity Trade-off: vague user intents overwhelm linear reasoning chains, leading to architectural collapse in complex repo-level generation. We propose Contract-Coding, a structured symbolic paradigm that bridges unstructured intent and executable code via Autonomous Symbolic Grounding. By projecting ambiguous intents into a formal Language Contract, our framework serves as a Single Source of Truth (SSOT) that enforces topological independence, effectively isolating inter-module implementation details, decreasing topological execution depth and unlocking Architectural Parallelism. Empirically, while state-of-the-art agents suffer from different hallucinations on the Greenfield-5 benchmark, Contract-Coding achieves 47% functional success while maintaining near-perfect structural integrity. Our work marks a critical step towards repository-scale autonomous engineering: transitioning from strict “specification-following” to robust, intent-driven architecture synthesis. Our code is available at this https URL.
[AI-67] ECM Contracts: Contract-Aware Versioned and Governable Capability Interfaces for Embodied Agents
【速读】:该论文旨在解决 embodied agents(具身智能体)中模块化能力单元(Embodied Capability Modules, ECMs)在实际运行时如何组成稳定软件生态系统的系统性问题,而非仅作为临时拼凑的功能包。其核心挑战在于现有方法缺乏对模块间交互、资源依赖、权限边界及演化兼容性的显式规范,导致组合不安全或升级不可控。解决方案的关键是提出 ECM Contracts —— 一种基于契约的接口模型,通过编码六维关键属性(功能签名、行为假设、资源需求、权限边界、恢复语义和版本兼容性),构建了一个支持静态检查与预部署验证的兼容性框架,从而实现模块安装、组合与升级过程中的类型错误、依赖冲突、策略违规、资源争用和恢复不一致等问题的提前识别与规避,并辅以版本感知的发布纪律(如兼容类划分、弃用规则、迁移约束和策略敏感升级检查),显著提升了模块化具身系统的安全性与可维护性。
链接: https://arxiv.org/abs/2604.13097
作者: Xue Qin,Simin Luan,John See,Cong Yang,Zhijun Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 24 pages, 4 figures, 12 tables
Abstract:Embodied agents increasingly rely on modular capabilities that can be installed, upgraded, composed, and governed at runtime. Prior work has introduced embodied capability modules (ECMs) as reusable units of embodied functionality, and recent research has explored their runtime governance and controlled evolution. However, a key systems question remains unresolved: how can ECMs be composed and released as a stable software ecosystem rather than as ad hoc skill bundles? We present ECM Contracts, a contract-based interface model for embodied capability modules. Unlike conventional software interfaces that specify only input and output types, ECM Contracts encode six dimensions essential for embodied execution: functional signature, behavioral assumptions, resource requirements, permission boundaries, recovery semantics, and version compatibility. Based on this model, we introduce a compatibility framework for ECM installation, composition, and upgrade, enabling static and pre-deployment checks for type mismatches, dependency conflicts, policy violations, resource contention, and recovery incompatibilities. We further propose a release discipline for embodied capabilities, including version-aware compatibility classes, deprecation rules, migration constraints, and policy-sensitive upgrade checks. We implement a prototype ECM registry, resolver, and contract checker, and evaluate the approach on modular embodied tasks in a robotics runtime setting. Results show that contract-aware composition substantially reduces unsafe or invalid module combinations, and that contract-guided release checks improve upgrade safety and rollback readiness compared with schema-only or ad hoc baselines. Our findings suggest that stable embodied software ecosystems require more than modular packaging: they require explicit contracts that connect capability composition, governance, and evolution. Comments: 24 pages, 4 figures, 12 tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: I.2.9; D.2.11; D.2.12 Cite as: arXiv:2604.13097 [cs.SE] (or arXiv:2604.13097v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.13097 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Xue Qin [view email] [v1] Fri, 10 Apr 2026 01:52:02 UTC (88 KB)
[AI-68] Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation
【速读】:该论文旨在解决稀疏终止奖励(sparse termination rewards)场景下,基于强化学习微调推理模型时长期训练中出现的无效更新累积(learning tax)、解概率漂移(solution probability drift)及熵崩溃(entropy collapse)等问题。其核心解决方案的关键在于从token级信用分配(credit assignment)角度提出一个算法设计的必要条件:为防止与奖励无关的漂移,组内目标必须保持token更新之间的梯度可交换性(gradient exchangeability),从而实现对低信用/高频token的梯度抵消(gradient cancellation)。作者进一步指出,两种常见机制会破坏这种可交换性,导致“非抵消”成为结构性常态;为此,他们提出最小化的组内变换,在共享token空间中恢复或近似抵消结构,实验表明该方法能显著稳定训练过程、提升样本效率并增强最终性能。
链接: https://arxiv.org/abs/2604.13088
作者: Fei Ding,Yongkang Zhang,youwei wang,Zijian Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In sparse termination rewards, intra-group comparisons have become the dominant paradigm for fine-tuning reasoning models via reinforcement learning. However, long-term training often leads to issues like ineffective update accumulation (learning tax), solution probability drift, and entropy collapse. This paper presents a necessary condition for algorithm design from a token-level credit assignment perspective: to prevent reward-irrelevant drift, intra-group objectives must maintain gradient exchangeability across token updates, enabling gradient cancellation on weak-credit/high-frequency tokens. We show that two common mechanisms disrupting exchangeability make “non-cancellation” a structural norm. Based on this, we propose minimal intra-group transformations to restore or approximate the cancellation structure in the shared token space. Experimental results demonstrate that these transformations stabilize training, improve sample efficiency, and enhance final performance, validating the value of this design condition.
[AI-69] Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments
【速读】:该论文旨在解决自主AI代理在动态环境中持续学习时面临的“灾难性遗忘”问题,即在获取新能力的同时避免对已有知识的破坏。解决方案的核心是提出自适应记忆结晶化(Adaptive Memory Crystallization, AMC)机制,其灵感来源于突触标记与捕获(Synaptic Tagging and Capture, STC)理论,但不模拟具体的分子或突触机制。AMC将记忆建模为一个连续的结晶过程,其中经验依据多目标效用信号从可塑状态(Liquid)迁移至稳定状态(Crystal),通过引入三阶段记忆层级(Liquid–Glass–Crystal)和由Itô随机微分方程(SDE)驱动的动力学系统实现这一过程,并利用Fokker–Planck方程描述群体行为,最终获得闭式Beta分布的稳态解。该框架在理论上保证了系统的适定性、全局收敛性和个体状态的指数收敛速率,同时提供了端到端Q-learning误差界及内存容量下界,明确将SDE参数与代理性能关联。
链接: https://arxiv.org/abs/2604.13085
作者: Rajat Khanda,Mohammad Baqar Sambuddha Chakrabarti,Satyasaran Changdar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous AI agents operating in dynamic environments face a persistent challenge: acquiring new capabilities without erasing prior knowledge. We present Adaptive Memory Crystallization (AMC), a memory architecture for progressive experience consolidation in continual reinforcement learning. AMC is conceptually inspired by the qualitative structure of synaptic tagging and capture (STC) theory, the idea that memories transition through discrete stability phases, but makes no claim to model the underlying molecular or synaptic mechanisms. AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid–Glass–Crystal) governed by an Itô stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker–Planck equation admitting a closed-form Beta stationary distribution. We provide proofs of: (i) well-posedness and global convergence of the crystallization SDE to a unique Beta stationary distribution; (ii) exponential convergence of individual crystallization states to their fixed points, with explicit rates and variance bounds; and (iii) end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance. Empirical evaluation on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion consistently shows improvements in forward transfer (+34–43% over the strongest baseline), reductions in catastrophic forgetting (67–80%), and a 62% decrease in memory footprint. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.13085 [cs.LG] (or arXiv:2604.13085v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.13085 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rajat Khanda [view email] [v1] Thu, 2 Apr 2026 22:53:34 UTC (491 KB) Full-text links: Access Paper: View a PDF of the paper titled Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments, by Rajat Khanda and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-70] he Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
【速读】:该论文旨在解决Transformer模型在算法任务中出现的“grokking”现象——即训练集拟合与泛化能力之间存在显著延迟的问题。研究表明,这种延迟并非源于模型未能学习到任务结构,而是由于解码器(decoder)对已习得结构的访问受限,形成瓶颈效应。解决方案的关键在于识别并干预这一瓶颈:通过将训练好的编码器(encoder)移植到新的模型中,可使grokking加速2.75倍;而冻结收敛后的编码器仅重新训练解码器,则能完全消除训练平台期,准确率达到97.6%,显著优于联合训练的86.1%。此外,数值表示方式(如不同进制)作为归纳偏置(inductive bias),决定了局部数字结构的可利用程度,从而极大影响模型的学习效率,例如基底24因与Collatz映射算术一致而达到99.8%准确率,而二进制则因表示坍缩而失败。
链接: https://arxiv.org/abs/2604.13082
作者: Laura Gomezjurado Gonzalez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 10 fugures
Abstract:Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place. We study one-step Collatz prediction and find that the encoder organizes parity and residue structure within the first few thousand training steps, while output accuracy remains near chance for tens of thousands more. Causal interventions support the decoder bottleneck hypothesis. Transplanting a trained encoder into a fresh model accelerates grokking by 2.75 times, while transplanting a trained decoder actively hurts. Freezing a converged encoder and retraining only the decoder eliminates the plateau entirely and yields 97.6% accuracy, compared to 86.1% for joint training. What makes the decoder’s job harder or easier depends on numeral representation. Across 15 bases, those whose factorization aligns with the Collatz map’s arithmetic (e.g., base 24) reach 99.8% accuracy, while binary fails completely because its representations collapse and never recover. The choice of base acts as an inductive bias that controls how much local digit structure the decoder can exploit, producing large differences in learnability from the same underlying task.
[AI-71] Sparse Goodness: How Selective Measurement Transforms Forward-Forward Learning
【速读】:该论文旨在解决生成式 AI (Generative AI) 中前向-前向(Forward-Forward, FF)算法在训练效率与性能上的局限性,尤其是其默认使用平方和(sum-of-squares, SoS)作为局部良度函数(goodness function)时所导致的表征能力不足问题。解决方案的关键在于系统性地探索良度函数的设计空间,提出两种核心改进:一是引入top-k 良度函数,仅评估最活跃的 k 个神经元,显著提升模型区分能力;二是采用entmax 加权能量,通过可学习的稀疏加权机制替代硬性 top-k 选择,实现自适应稀疏性。此外,结合独立标签特征前向传播(FFCL),将类别假设在每一层注入而非仅在输入层拼接,进一步增强网络表达能力。实验表明,上述方法组合可在 Fashion-MNIST 上实现 87.1% 的准确率,相较 SoS 基线提升 30.7 个百分点,且关键发现是:良度函数中的稀疏性设计是 FF 网络性能提升的最关键因素,其中自适应稀疏性(α ≈ 1.5)优于完全稠密或完全稀疏方案。
链接: https://arxiv.org/abs/2604.13081
作者: Kamer Ali Yuksel,Hassan Sawaf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:The Forward-Forward (FF) algorithm is a biologically plausible alternative to backpropagation that trains neural networks layer by layer using a local goodness function to distinguish positive from negative data. Since its introduction, sum-of-squares (SoS) has served as the default goodness function. In this work, we systematically study the design space of goodness functions, investigating both which activations to measure and how to aggregate them. We introduce top-k goodness, which evaluates only the k most active neurons, and show that it substantially outperforms SoS, improving Fashion-MNIST accuracy by 22.6 percentage points. We further introduce entmax-weighted energy, which replaces hard top-k selection with a learnable sparse weighting based on the alpha-entmax transformation, yielding additional gains. Orthogonally, we adopt separate label feature forwarding (FFCL), in which class hypotheses are injected at every layer through a dedicated projection rather than concatenated only at the input. Combining these ideas, we achieve 87.1 percent accuracy on Fashion-MNIST with a 4x2000 architecture, representing a 30.7 percentage point improvement over the SoS baseline while changing only the goodness function and the label pathway. Across controlled experiments covering 11 goodness functions, two architectures, and a sparsity spectrum analysis over both k and alpha, we identify a consistent principle: sparsity in the goodness function is the most important design choice in FF networks. In particular, adaptive sparsity with alpha approximately 1.5 outperforms both fully dense and fully sparse alternatives.
[AI-72] Alignment as Institutional Design: From Behavioral Correction to Transaction Structure in Intelligent Systems
【速读】:该论文旨在解决当前人工智能对齐(AI alignment)方法中依赖外部行为校正(behavioral correction)所带来的可扩展性与稳定性问题,例如强化学习人类反馈(RLHF)等机制需持续监督和调整模型输出,难以适应复杂系统演化。其解决方案的关键在于将对齐问题重构为制度设计(institutional design)问题:通过明确内部交易结构(如模块边界、竞争拓扑与成本反馈回路),使符合人类意图的行为成为各组件的最低成本策略,从而实现对齐作为内生均衡结果而非外在控制目标。此框架强调制度韧性而非完美对齐,主张通过机制设计使偏离行为变得昂贵、可检测且可纠正,为后续Wuxing资源竞争机制提供规范基础。
链接: https://arxiv.org/abs/2604.13079
作者: Rui Chai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: This is Paper 5 in a 10-paper series on Super-Alignment via Wuxing Institutional Architecture. It shifts alignment from external behavioral correction to internal institutional design, making aligned behavior the lowest-cost equilibrium
Abstract:Current AI alignment paradigms rely on behavioral correction: external supervisors (e.g., RLHF) observe outputs, judge against preferences, and adjust parameters. This paper argues that behavioral correction is structurally analogous to an economy without property rights, where order requires perpetual policing and does not scale. Drawing on institutional economics (Coase, Alchian, Cheung), capability mutual exclusivity, and competitive cost discovery, we propose alignment as institutional design: the designer specifies internal transaction structures (module boundaries, competition topologies, cost-feedback loops) such that aligned behavior emerges as the lowest-cost strategy for each component. We identify three irreducible levels of human intervention (structural, parametric, monitorial) and show that this framework transforms alignment from a behavioral control problem into a political-economy problem. No institution eliminates self-interest or guarantees optimality; the best design makes misalignment costly, detectable, and correctable. We conclude that the proper goal is institutional robustness-a dynamic, self-correcting process under human oversight, not perfection. This work provides the normative foundation for the Wuxing resource-competition mechanisms in companion papers. Keywords: AI alignment, institutional design, transaction costs, property rights, resource competition, behavioral correction, RLHF, cost truthfulness, modular architecture, correctable alignment Comments: This is Paper 5 in a 10-paper series on Super-Alignment via Wuxing Institutional Architecture. It shifts alignment from external behavioral correction to internal institutional design, making aligned behavior the lowest-cost equilibrium Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2604.13079 [cs.CY] (or arXiv:2604.13079v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.13079 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-73] Hijacking online reviews: sparse manipulation and behavioral buffering in popularity-biased rating systems
【速读】:该论文旨在解决在线评论系统和推荐算法中因用户行为对流行度偏倚(popularity bias)的敏感性所引发的自增强式扭曲问题,即恶意用户如何利用评分动态操纵推荐结果,进而误导其他用户决策。其解决方案的关键在于构建一个最小化的基于代理(agent-based)模型,模拟用户根据当前显示平均评分决定评价对象的行为机制,并通过对比广域攻击与稀疏攻击(sparse attacks)的效果,揭示用户响应异质性(尤其是反叛型用户比例)在缓解攻击损害中的作用:研究发现,适度的用户行为多样性——特别是反叛型用户的存在——能有效抑制低质量内容的虚假崛起,从而部分缓冲推荐系统的失真风险,而非完全恢复高质量内容至顶端。这一机制表明,推荐鲁棒性不仅依赖于攻击检测与预测精度,更受评论密度、流行度反馈效应及用户行为多样性的共同影响。
链接: https://arxiv.org/abs/2604.13049
作者: Itsuki Fujisaki,Kunhao Yang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 18page, 3figures
Abstract:Online reviews and recommendation systems help users navigate overwhelming choice, but they are vulnerable to self-reinforcing distortions. This paper examines how a single malicious reviewer can exploit popularity-biased rating dynamics and whether behavioral heterogeneity in user responses can reduce the damage. We develop a minimal agent-based model in which users choose what to rate partly on the basis of currently displayed averages. We compare broad attacks that perturb many items with sparse attacks that selectively boost low-quality items and suppress high-quality items. Additional analyses not shown here indicate that sparse attacks are substantially more harmful than broad attacks because they better exploit popularity-based exposure. The main text then focuses on sparse attacks and asks how their effects change as the fraction of contrarian users increases. Three results stand out. First, attack-induced damage is strongest when prior honest reviews are scarce, revealing a transition from a fragile low-information regime to a more robust high-information regime. Second, sparse attacks are especially effective at artificially promoting low-quality items. Third, moderate contrarian diversity partially buffers these distortions, primarily by suppressing the rise of low-quality items rather than fully restoring high-quality items to the top. The findings suggest that recommendation robustness depends not only on attack detection and predictive accuracy, but also on review density, popularity feedback, and user response heterogeneity.
[AI-74] From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability
【速读】:该论文旨在解决云原生平台中可观测性数据查询的难题,即平台工程师和站点可靠性团队在使用PromQL等特定领域语言编写正确查询时面临的高门槛问题。解决方案的关键在于提出了一种基于目录驱动的框架,其核心创新包括:(1)融合静态预定义约2000个指标与运行时发现的GPU厂商硬件特有信号的混合指标目录;(2)多阶段查询处理流水线,包含意图分类、类别感知的指标路由及多维语义评分机制;(3)动态时间分辨率机制,可将自然语言中的多样化时间表达式映射为合适的PromQL时间持续期语法。该框架通过预计算的类别索引实现亚秒级指标发现,并在完整路径下约1.1秒内完成查询生成,已在管理AI推理负载的生产Kubernetes集群中部署,支持对集群健康、GPU利用率和模型服务性能等约2000个指标进行自然语言查询。
链接: https://arxiv.org/abs/2604.13048
作者: Twinkll Sisodia
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 15 pages, 7 tables, 1 figure
Abstract:Modern cloud-native platforms expose thousands of time series metrics through systems like Prometheus, yet formulating correct queries in domain-specific languages such as PromQL remains a significant barrier for platform engineers and site reliability teams. We present a catalog-driven framework that translates natural language questions into executable PromQL queries, bridging the gap between human intent and observability data. Our approach introduces three contributions: (1) a hybrid metrics catalog that combines a statically curated base of approximately 2,000 metrics with runtime discovery of hardware-specific signals across GPU vendors, (2) a multi-stage query pipeline with intent classification, category-aware metric routing, and multi-dimensional semantic scoring, and (3) a dynamic temporal resolution mechanism that interprets diverse natural language time expressions and maps them to appropriate PromQL duration syntax. We integrate the framework with the Model Context Protocol (MCP) to enable tool-augmented LLM interactions across multiple providers. The catalog-driven approach achieves sub-second metric discovery through pre-computed category indices, with the full pipeline completing in approximately 1.1 seconds via the catalog path. The system has been deployed on production Kubernetes clusters managing AI inference workloads, where it supports natural language querying across approximately 2,000 metrics spanning cluster health, GPU utilization, and model-serving performance.
[AI-75] Integration of Deep Reinforcement Learning and Agent -based Simulation to Explore Strategies Counteracting Information Disorder
【速读】:该论文旨在解决信息失序(Information Disorders, ID)在社交媒体上的传播问题,特别是虚假新闻的扩散机制及其防控策略。其核心解决方案在于融合数据驱动与模型驱动两类研究范式:一方面构建基于智能体的仿真模型(Agent-Based Model),科学模拟虚假新闻的复杂动态及防控措施的影响;另一方面引入深度强化学习(Deep Reinforcement Learning),自动学习最优的遏制策略以抑制 misinformation 的传播。这一整合方法不仅为识别有效政策条件提供了实证线索,也为社会仿真与人工智能融合研究开辟了新路径。
链接: https://arxiv.org/abs/2604.13047
作者: Luigi Lomasto,Andrea Camoia,Alfonso Guarino,Nicola Lettieri,Delfina Malandrino,Rocco Zaccagnino
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:In recent years, the spread of fake news has triggered a growing interest in Information Disorders (ID) on social media, a phenomenon that has become a focal point of research across fields ranging from complexity theory and computer science to cognitive sciences. Overall, such a body of research can be traced back to two main approaches. On the one hand, there are works focused on exploiting data mining to analyze the content of news and related metadata data-driven approach; on the other hand, works are aiming at making sense of the phenomenon at hand and their evolution using explicit simulation models model-driven approach). In this paper, we integrate these approaches to explore strategies for counteracting IDs. Heading in this direction, we put together: i. an Agent-Based model to simulate in a scientifically sound way both complex fake news dynamics and the effects produced by containment strategies therein; ii. Deep Reinforcement Learning to learn the strategies that can better mitigate the spread of misinformation. The outcomes of our work unfold on different levels. From a substantive point of view, the results of preliminary experiments started providing interesting cues about the conditions under which given policies can mitigate the spread of misinformation. From a technical and methodological point of view, we scratched the surface of promising and worthy research topics like the integration of social simulation and artificial intelligence and the enhancement of social science simulation environments.
[AI-76] A Pythonic Functional Approach for Semantic Data Harmonisation in the ILIAD Project
【速读】:该论文旨在解决环境数据语义标准化(Semantic Data Harmonisation)过程中因依赖复杂技术细节而带来的实践障碍,特别是在ILiad项目中,需将异构环境数据按照海洋信息模型(Ocean Information Model, OIM)进行统一建模,但现有方法如RML和OTTR要求使用者掌握语义网标准(如命名空间、IRI、OWL构造器及本体设计模式)并熟悉专用语法与工具,导致数据科学家难以高效参与。解决方案的关键在于提出一种面向Python的函数式语义数据标准化方法:通过分层设计的Python库实现OIM设计模式的封装——低层级函数直接暴露RDF/OWL语法,中层级函数封装本体设计模式,高层域特定函数协调数据转换任务,从而在不牺牲语义正确性的前提下显著降低使用门槛,并无缝集成于Python生态,提升数据科学家对语义标准化工作的参与度与效率。
链接: https://arxiv.org/abs/2604.13042
作者: Erik Johan Nystad,Francisco Martín-Recuerda
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 17 pages, 9 figures
Abstract:Semantic data harmonisation is a central requirement in the ILIAD project, where heterogeneous environmental data must be harmonised according to the Ocean Information Model (OIM), a modular family of ontologies for enabling the implementation of interoperable Digital Twins of the Ocean. Existing approaches to Semantic Data Harmonisation, such as RML and OTTR, offer valuable abstractions but require extensive knowledge of the technical intricacies of the OIM and the Semantic Web standards, including namespaces, IRIs, OWL constructors, and ontology design patterns. Furthermore, RML and OTTR oblige practitioners to learn specialised syntaxes and dedicated tooling. Data scientists in ILIAD have found these approaches overly cumbersome and have therefore expressed the need for a solution that abstracts away these technical details while remaining seamlessly integrated into their Python-based environments. To address these requirements, we have developed a Pythonic functional approach to semantic data harmonisation that enables users to produce correct RDF through simple function calls. The functions, structured as Python libraries, encode the design patterns of the OIM and are organised across multiple levels of abstraction. Low-level functions directly expose OWL and RDF syntax, mid-level functions encapsulate ontology design patterns, and high-level domain-specific functions orchestrate data harmonisation tasks by invoking mid-level functions. According to feedback from ILIAD data scientists, this approach satisfies their requirements and substantially enhances their ability to participate in harmonisation activities. In this paper, we present the details of our Pythonic functional approach to semantic data harmonisation and demonstrate its applicability within the ILIAD Aquaculture pilot.
[AI-77] ableNet A Large-Scale Table Dataset with LLM -Powered Autonomous AAAI
【速读】:该论文旨在解决当前表格结构识别(Table Structure Recognition, TSR)任务中因数据集规模和质量有限,导致大语言模型(Large Language Models, LLMs)逻辑推理能力难以有效发挥的问题。解决方案的关键在于提出一个基于LLM驱动的自主表格生成与识别多智能体系统:其生成模块通过整合可控的视觉、结构和语义参数,实现多样化且语义一致的表格图像合成,支持大规模、精细化的数据构建;其识别模块则采用基于多样性的主动学习范式,从多源表格中选择最具信息量的数据进行微调,在显著减少训练样本的同时,提升了在真实网页表格上的性能表现,是首个将主动学习应用于行数、列数、合并单元格及内容多样性显著的表格结构识别任务的工作。
链接: https://arxiv.org/abs/2604.13041
作者: Ruilin Zhang,Kai Yang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: The 40th Annual AAAI Conference on Artificial Intelligence Bridge Program on Logic AI
Abstract:Table Structure Recognition (TSR) requires the logical reasoning ability of large language models (LLMs) to handle complex table layouts, but current datasets are limited in scale and quality, hindering effective use of this reasoning capacity. We thus present TableNet dataset, a new table structure recognition dataset collected and generated through multiple sources. Central to our approach is the first LLM-powered autonomous table generation and recognition multi-agent system that we developed. The generation part of our system integrates controllable visual, structural, and semantic parameters into the synthesis of table images. It facilitates the creation of a wide array of semantically coherent tables, adaptable to user-defined configurations along with annotations, thereby supporting large-scale and detailed dataset construction. This capability enables a comprehensive and nuanced table image annotation taxonomy, potentially advancing research in table-related domains. In contrast to traditional data collection methods, This approach facilitates the theoretically infinite, domain-agnostic, and style-flexible generation of table images, ensuring both efficiency and precision. The recognition part of our system is a diversity-based active learning paradigm that utilizes tables from multiple sources and selectively samples most informative data to finetune a model, achieving a competitive performance on TableNet test set while reducing training samples by a large margin compared with baselines, and a much higher performance on web-crawled real-world tables compared with models trained on predominant table datasets. To the best of our knowledge, this is the first work which employs active learning into the structure recognition of tables which is diverse in numbers of rows or columns, merged cells, cell contents, etc, which fits better for diversity-based active learning.
[AI-78] OVT-MLCS: An Online Visual Tool for MLCS Mining from Long or Big Sequences
【速读】:该论文旨在解决从三个或更多长序列(长度≥1,000)或大数据序列(长度≥10,000)中精确挖掘多个最长公共子序列(Multiple Longest Common Subsequences, MLCS)这一经典NP-hard问题在实际应用中的瓶颈。当前缺乏能够高效处理此类大规模序列的MLCS算法或工具,严重制约了其在各领域的大规模应用。解决方案的关键在于提出一种基于关键点(key point)的新算法KP-MLCS,通过结构化表示所有挖掘出的MLCS并快速揭示其共性模式,同时结合实时图形可视化与序列化技术,开发出在线可视化MLCS挖掘工具OVT-MLCS,实现了对长/大数据序列中MLCS的有效在线挖掘、存储、下载及交互式分析功能。
链接: https://arxiv.org/abs/2604.13037
作者: Zhi Wang,Yanni Li,Tihua Duan,Bing Liu,Liyong Zhang,Hui Li
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Mining multiple longest common subsequences (\textitMLCS) from a set of sequences of three or more over a finite alphabet \Sigma (a classical NP-hard problem) is an important task in a wide variety of application fields. Unfortunately, there is still no exact \textitMLCS algorithm/tool that can handle long (length \ge 1,000) or big (length \ge 10,000) sequences, which seriously hinders the development and utilization of massive long or big sequences from various application fields today. To address the challenge, we first propose a novel key point-based \textitMLCS algorithm for mining big sequences, called \textitKP-MLCS, and then present a new method, which can compactly represent all mined \textitMLCSs and quickly reveal common patterns among them. Furthermore, by introducing some new techniques, e.g., real-time graphic visualization and serialization, we have developed a new online visual \textitMLCS mining tool, called OVT-MLCS. OVT-MLCS demonstrates that it not only enables effective online mining, storing, and downloading of \textitMLCSs in the form of graphs and text from long or big sequences with a scale of 3 to 5000 but also provides user-friendly interactive functions to facilitate inspection and analysis of the mined \textitMLCSs. We believe that the functions provided by OVT-MLCS will promote stronger and wider applications of \textitMLCS.
[AI-79] Finetuning-Free Diffusion Model with Adaptive Constraint Guidance for Inorganic Crystal Structure Generation
【速读】:该论文旨在解决无机晶体结构设计中难以生成满足特定物理化学性质、且具有热力学稳定性的新材料的问题。当前生成式 AI 模型在产生多样、原创且可实验实现的晶体结构方面仍存在局限,尤其在高风险应用场景下可靠性不足。其解决方案的关键在于提出一种基于扩散模型(diffusion models)的生成机器学习框架,并引入自适应约束引导机制,在生成过程中嵌入用户定义的物理和化学约束,从而提升生成结构的可控性与合理性;同时构建多步验证流程,结合图神经网络(graph neural network, GNN)预测器(达到密度泛函理论 DFT 级精度)与凸包分析(convex hull analysis)以确保候选结构的热力学稳定性,最终实现对多种无机体系中目标几何约束下可行晶体结构的高效生成与验证。
链接: https://arxiv.org/abs/2604.13354
作者: Auguste de Lambilly,Vladimir Baturin,David Portehault,Guillaume Lambard,Nataliya Sokolovska,Florence d’Alché-Buc,Jean-Claude Crivello
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: Full article including supplementary information, 55 pages, 9 figures
Abstract:The discovery of inorganic crystal structures with targeted properties is a significant challenge in materials science. Generative models, especially state-of-the-art diffusion models, offer the promise of modeling complex data distributions and proposing novel, realistic samples. However, current generative AI models still struggle to produce diverse, original, and reliable structures of experimentally achievable materials suitable for high-stakes applications. In this work, we propose a generative machine learning framework based on diffusion models with adaptive constraint guidance, which enables the incorporation of user-defined physical and chemical constraints during the generation process. This approach is designed to be practical and interpretable for human experts, allowing transparent decision-making and expert-driven exploration. To ensure the robustness and validity of the generated candidates, we introduce a multi-step validation pipeline that combines graph neural network estimators trained to achieve DFT-level accuracy and convex hull analysis for assessing thermodynamic stability. Our approach has been tested and validated on several classical examples of inorganic families of compounds, as case studies. As a consequence, these preliminary results demonstrate our framework’s ability to generate thermodynamically plausible crystal structures that satisfy targeted geometric constraints across diverse inorganic chemical systems. Comments: Full article including supplementary information, 55 pages, 9 figures Subjects: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.13354 [cond-mat.mtrl-sci] (or arXiv:2604.13354v1 [cond-mat.mtrl-sci] for this version) https://doi.org/10.48550/arXiv.2604.13354 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-80] Identifiability of Potentially Degenerate Gaussian Mixture Models With Piecewise Affine Mixing AISTATS2026
【速读】:该论文旨在解决因果表示学习(Causal Representation Learning, CRL)中从高维观测数据中识别潜在变量的问题,尤其在潜在变量服从可能退化的高斯混合分布、且仅通过分段仿射混合函数(piecewise affine mixing function)观测到的挑战性场景下。其核心难点在于概率密度函数因退化而未定义,导致传统可识别性分析失效。解决方案的关键在于:首先基于稀疏正则化(sparsity regularization)实现潜在变量在排列和尺度变换下的可识别性;进而提出一种两阶段估计方法,在学习过程中同时强制约束表示的稀疏性和高斯性,从而有效恢复真实的潜在变量。实验结果表明该方法在合成数据和图像数据上均能准确重构潜在结构。
链接: https://arxiv.org/abs/2604.13218
作者: Danru Xu,Sébastien Lachapelle,Sara Magliacane
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: 49 pages, 10 figures, AISTATS 2026
Abstract:Causal representation learning (CRL) aims to identify the underlying latent variables from high-dimensional observations, even when variables are dependent with each other. We study this problem for latent variables that follow a potentially degenerate Gaussian mixture distribution and that are only observed through the transformation via a piecewise affine mixing function. We provide a series of progressively stronger identifiability results for this challenging setting in which the probability density functions are ill-defined because of the potential degeneracy. For identifiability up to permutation and scaling, we leverage a sparsity regularization on the learned representation. Based on our theoretical results, we propose a two-stage method to estimate the latent variables by enforcing sparsity and Gaussianity in the learned representations. Experiments on synthetic and image data highlight our method’s effectiveness in recovering the ground-truth latent variables.
机器学习
[LG-0] Complex Interpolation of Matrices with an application to Multi-Manifold Learning
链接: https://arxiv.org/abs/2604.14118
作者: Adi Arbel,Stefan Steinerberger,Ronen Talmon
类目: Machine Learning (cs.LG); Spectral Theory (math.SP)
*备注:
Abstract:Given two symmetric positive-definite matrices A, B \in \mathbbR^n \times n , we study the spectral properties of the interpolation A^1-x B^x for 0 \leq x \leq 1 . The presence of `common structures’ in A and B , eigenvectors pointing in a similar direction, can be investigated using this interpolation perspective. Generically, exact log-linearity of the operator norm |A^1-x B^x| is equivalent to the existence of a shared eigenvector in the original matrices; stability bounds show that approximate log-linearity forces principal singular vectors to align with leading eigenvectors of both matrices. These results give rise to and provide theoretical justification for a multi-manifold learning framework that identifies common and distinct latent structures in multiview data.
[LG-1] Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
链接: https://arxiv.org/abs/2604.14108
作者: Arseniy Andreyev,Advikar Ananthkumar,Marc Walden,Tomaso Poggio,Pierfrancesco Beneventano
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 40 pages, 38 figures
Abstract:Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau 2(1-\beta)/\eta , reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau 2(1+\beta)/\eta , where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.
[LG-2] Neural architectures for resolving references in program code
链接: https://arxiv.org/abs/2604.14073
作者: Gergő Szalay,Gergely Zsolt Kovács,Sándor Teleki,Balázs Pintér,Tibor Gregorics
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Resolving and rewriting references is fundamental in programming languages. Motivated by a real-world decompilation task, we abstract reference rewriting into the problems of direct and indirect indexing by permutation. We create synthetic benchmarks for these tasks and show that well-known sequence-to-sequence machine learning architectures are struggling on these benchmarks. We introduce new sequence-to-sequence architectures for both problems. Our measurements show that our architectures outperform the baselines in both robustness and scalability: our models can handle examples that are ten times longer compared to the best baseline. We measure the impact of our architecture in the real-world task of decompiling switch statements, which has an indexing subtask. According to our measurements, the extended model decreases the error rate by 42%. Multiple ablation studies show that all components of our architectures are essential.
[LG-3] A Complete Symmetry Classification of Shallow ReLU Networks
链接: https://arxiv.org/abs/2604.14037
作者: Pranavkrishnan Ramakrishnan
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG); Combinatorics (math.CO)
*备注:
Abstract:Parameter space is not function space for neural network architectures. This fact, investigated as early as the 1990s under terms such as reverse engineering," or parameter identifiability", has led to the natural question of parameter space symmetries\textemdash the study of distinct parameters in neural architectures which realize the same function. Indeed, the quotient space obtained by identifying parameters giving rise to the same function, called the \textitneuromanifold, has been shown in some cases to have rich geometric properties, impacting optimization dynamics. Thus far, techniques towards complete classifications have required the analyticity of the activation function, notably excising the important case of ReLU. Here, in contrast, we exploit the non-differentiability of the ReLU activation to provide a complete classification of the symmetries in the shallow case.
[LG-4] Physics-Informed Neural Networks for Methane Sorption: Cross-Gas Transfer Learning Ensemble Collapse Under Physics Constraints and Monte Carlo Dropout Uncertainty Quantification
链接: https://arxiv.org/abs/2604.13992
作者: Mohammad Nooraiepour,Zezhang Song,Wei Li,Sarah Perez
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate methane sorption prediction across heterogeneous coal ranks requires models that combine thermodynamic consistency, efficient knowledge transfer across data-scarce geological systems, and calibrated uncertainty estimates, capabilities that are rarely addressed together in existing frameworks. We present a physics-informed transfer learning framework that adapts a hydrogen sorption PINN to methane sorption prediction via Elastic Weight Consolidation, coal-specific feature engineering, and a three-phase curriculum that progressively balances transfer preservation with thermodynamic fine-tuning. Trained on 993 equilibrium measurements from 114 independent coal experiments spanning lignite to anthracite, the framework achieves R2 = 0.932 on held-out coal samples, a 227% improvement over pressure-only classical isotherms, while hydrogen pre-training delivers 18.9% lower RMSE and 19.4% faster convergence than random initialization. Five Bayesian uncertainty quantification approaches reveal a systematic divergence in performance across physics-constrained architectures. Monte Carlo Dropout achieves well-calibrated uncertainty at minimal overhead, while deep ensembles, regardless of architectural diversity or initialization strategy, exhibit performance degradation because shared physics constraints narrow the admissible solution manifold. SHAP and ALE analyses confirm that learned representations remain physically interpretable and aligned with established coal sorption mechanisms: moisture-volatile interactions are most influential, pressure-temperature coupling captures thermodynamic co-dependence, and features exhibit non-monotonic effects. These results identify Monte Carlo Dropout as the best-performing UQ method in this physics-constrained transfer learning framework, and demonstrate cross-gas transfer learning as a data-efficient strategy for geological material modeling.
[LG-5] Unsupervised domain transfer: Overcoming signal degradation in sleep monitoring by increasing scoring realism
链接: https://arxiv.org/abs/2604.13988
作者: Mohammad Ahangarkiasari,Andreas Tind Damgaard,Casper Haurum,Kaare B. Mikkelsen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Objective: Investigate whether hypnogram ‘realism’ can be used to guide an unsupervised method for handling arbitrary types of signal degradation in mobile sleep monitoring. Approach: Combining a pretrained, state-of-the-art ‘u-sleep’ model with a ‘discriminator’ network, we align features from a target domain with a feature space learned during pretraining. To test the approach, we distort the source domain with realistic signal degradations, to see how well the method can adapt to different types of degradation. We compare the performance of the resulting model with best-case models designed in a supervised manner for each type of transfer. Main Results: Depending on the type of distortion, we find that the unsupervised approach can increase Cohen’s kappa with as little as 0.03 and up to 0.29, and that for all transfers, the method does not decrease performance. However, the approach never quite reaches the estimated theoretical optimal performance, and when tested on a real-life domain mismatch between two sleep studies, the benefit was insignificant. Significance: ‘Discriminator-guided fine tuning’ is an interesting approach to handling signal degradation for ‘in the wild’ sleep monitoring, with some promise. In particular, what it says about sleep data in general is interesting. However, more development will be necessary before using it ‘in production’. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2604.13988 [cs.LG] (or arXiv:2604.13988v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.13988 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kaare Mikkelsen [view email] [v1] Wed, 15 Apr 2026 15:35:27 UTC (2,058 KB)
[LG-6] PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling
链接: https://arxiv.org/abs/2604.13986
作者: Zichao Yan,Yan Wu,Mica Xu Ji,Chaitra Agrahar,Esther Wershof,Marcel Nassar,Mehrshad Sadria,Ridvan Eksi,Vladimir Trifonov,Ignacio Ibarra,Telmo Felgueira,Błażej Osiński,Rory Stark
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predicting the effects of perturbations in-silico on cell state can identify drivers of cell behavior at scale and accelerate drug discovery. However, modeling challenges remain due to the inherent heterogeneity of single cell gene expression and the complex, latent gene dependencies. Here, we present PRiMeFlow, an end-to-end flow matching based approach to directly model the effects of genetic and small molecule perturbations in the gene expression space. The distribution-fitting approach taken by PRiMeFlow enables it to accurately approximate the empirical distribution of single-cell gene expression, which we demonstrate through extensive benchmarking inside PerturBench. Through ablation studies, we also validate important model design choices such as operating in gene expression space and parameterizing the velocity field with a U-Net architecture. The PRiMeFlow architecture was used as the basis for the model that won the Generalist Prize in the first ARC Virtual Cell Challenge.
[LG-7] BOAT: Navigating the Sea of In Silico Predictors for Antibody Design via Multi-Objective Bayesian Optimization AISTATS
链接: https://arxiv.org/abs/2604.13980
作者: Jackie Rao,Ferran Gonzalez Hernandez,Leon Gerard,Alexandra Gessner
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026
Abstract:Antibody lead optimization is inherently a multi-objective challenge in drug discovery. Achieving a balance between different drug-like properties is crucial for the development of viable candidates, and this search becomes exponentially challenging as desired properties grow. The ever-growing zoo of sophisticated in silico tools for predicting antibody properties calls for an efficient joint optimization procedure to overcome resource-intensive sequential filtering pipelines. We present BOAT, a versatile Bayesian optimization framework for multi-property antibody engineering. Our `plug-and-play’ framework couples uncertainty-aware surrogate modeling with a genetic algorithm to jointly optimize various predicted antibody traits while enabling efficient exploration of sequence space. Through systematic benchmarking against genetic algorithms and newer generative learning approaches, we demonstrate competitive performance with state-of-the-art methods for multi-objective protein optimization. We identify clear regimes where surrogate-driven optimization outperforms expensive generative approaches and establish practical limits imposed by sequence dimensionality and oracle costs.
[LG-8] Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation
链接: https://arxiv.org/abs/2604.13966
作者: Shangzhe Li,Weitong Zhang
类目: Machine Learning (cs.LG)
*备注: 44 pages, 2 tables
Abstract:We study value adaptation in offline-to-online reinforcement learning under general function approximation. Starting from an imperfect offline pretrained Q -function, the learner aims to adapt it to the target environment using only a limited amount of online interaction. We first characterize the difficulty of this setting by establishing a minimax lower bound, showing that even when the pretrained Q -function is close to optimal Q^\star , online adaptation can be no more efficient than pure online RL on certain hard instances. On the positive side, under a novel structural condition on the offline-pretrained value functions, we propose O2O-LSVI, an adaptation algorithm with problem-dependent sample complexity that provably improves over pure online RL. Finally, we complement our theory with neural-network experiments that demonstrate the practical effectiveness of the proposed method.
[LG-9] Quantum Machine Learning for Colorectal Cancer Data: Anastomotic Leak Classification and Risk Factors
链接: https://arxiv.org/abs/2604.13951
作者: Vojtěch Novák,Ivan Zelinka,Lenka Přibylová,Lubomír Martínek,Vladimír Benčurík,Martin Beseda
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:This study evaluates colorectal risk factors and compares classical models against Quantum Neural Networks (QNNs) for anastomotic leak prediction. Analyzing clinical data with 14% leak prevalence, we tested ZZFeatureMap encodings with RealAmplitudes and EfficientSU2 ansatze under simulated noise. F_\beta -optimized quantum configurations yielded significantly higher sensitivity (83.3%) than classical baselines (66.7%). This demonstrates that quantum feature spaces better prioritize minority class identification, which is critical for low-prevalence clinical risk prediction. Our work explores various optimizers under noisy conditions, highlighting key trade-offs and future directions for hardware deployment.
[LG-10] Unsupervised Anomaly Detection in Process-Complex Industrial Time Series: A Real-World Case Study
链接: https://arxiv.org/abs/2604.13928
作者: Sergej Krasnikov,Lukas Meitz,Samineh Bagheri,Michael Heider,Thorsten Schöler,Jörg Hähner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Industrial time-series data from real production environments exhibits substantially higher complexity than commonly used benchmark datasets, primarily due to heterogeneous, multi-stage operational processes. As a result, anomaly detection methods validated under simplified conditions often fail to generalize to industrial settings. This work presents an empirical study on a unique dataset collected from fully operational industrial machinery, explicitly capturing pronounced process-induced variability. We evaluate which model classes are capable of capturing this complexity, starting with a classical Isolation Forest baseline and extending to multiple autoencoder architectures. Experimental results show that Isolation Forest is insufficient for modeling the non-periodic, multi-scale dynamics present in the data, whereas autoencoders consistently perform better. Among them, temporal convolutional autoencoders achieve the most robust performance, while recurrent and variational variants require more careful tuning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.13928 [cs.LG] (or arXiv:2604.13928v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.13928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
链接: https://arxiv.org/abs/2604.13902
作者: Xiaofan Li,Ming Yang,Zhiyuan Ma,Shichao Ma,Jintao Du,Yu Cheng,Weiqiang Wang,Zhizhong Zhang,Xin Tan,Yanyun Qu,Lizhuang Ma,Yuan Xie
类目: Machine Learning (cs.LG)
*备注: LLM Reinforce Learning
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.
[LG-12] MolCryst-MLIPs: A Machine-Learned Interatomic Potentials Database for Molecular Crystals
链接: https://arxiv.org/abs/2604.13897
作者: Adam Lahouari,Shen Ai,Jihye Han,Jillian Hoffstadt,Philipp Hoellmer,Charlotte Infante,Pulkita Jain,Sangram Kadam,Maya M. Martirossyan,Amara McCune,Hypatia Newton,Shlok J. Paul,Willmor Pena,Jonathan Raghoonanan,Sumon Sahu,Oliver Tan,Andrea Vergara,Jutta Rogal,Mark E. Tuckerman
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:We present an open Molecular Crystal (MC) database of Machine-Learned Interatomic Potentials (MLIP) called MolCryst-MLIPs. The first release comprises fine-tuned MACE models for nine molecular crystal systems – Benzamide, Benzoic acid, Coumarin, Durene, Isonicotinamide, Niacinamide, Nicotinamide, Pyrazinamide, and Resorcinol – developed using the Automated Machine Learning Pipeline (AMLP), which streamlines the entire MLIP development workflow, from reference data generation to model training and validation, into a reproducible and user-friendly pipeline. Models are fine-tuned from the MACE-MH-1 foundation model (omol head), yielding a mean energy MAE of 0.141 kJ/mol/atom and a mean force MAE of 0.648 kJ/mol/Angstrom across all systems. Dynamical stability and structural integrity, as assessed through energy conservation, P2 orientational order parameters, and radial distribution functions, are evaluated using molecular dynamics simulations. The released models and datasets constitute a growing open database of validated MLIPs, ready for production MD simulations of molecular crystal polymorphism under different thermodynamic conditions.
[LG-13] Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety
链接: https://arxiv.org/abs/2604.13878
作者: Hossem Eddine Hafidi,Elisabetta De Giovanni,Teodoro Montanaro,Ilaria Sergi,Massimo De Vittorio,Luigi Patrono
类目: Machine Learning (cs.LG)
*备注: This manuscript is 10 pages long and includes 12 figures and 3 tables. The figures provide detailed visualizations of the proposed system architecture, ECG-based drowsiness detection pipeline, Double-Dueling DQN framework, and experimental evaluation results in the CARLA simulation environment
Abstract:Driver drowsiness significantly impairs the ability to accurately judge safe braking distances and is estimated to contribute to 10%-20% of road accidents in Europe. Traditional driver-assistance systems lack adaptability to real-time physiological states such as drowsiness. This paper proposes a deep reinforcement learning-based autonomous braking system that integrates vehicle dynamics with driver physiological data. Drowsiness is detected from ECG signals using a Recurrent Neural Network (RNN), selected through an extensive benchmark analysis of 2-minute windows with varying segmentation and overlap configurations. The inferred drowsiness state is incorporated into the observable state space of a Double-Dueling Deep Q-Network (DQN) agent, where driver impairment is modeled as an action delay. The system is implemented and evaluated in a high-fidelity CARLA simulation environment. Experimental results show that the proposed agent achieves a 99.99% success rate in avoiding collisions under both drowsy and non-drowsy conditions. These findings demonstrate the effectiveness of physiology-aware control strategies for enhancing adaptive and intelligent driving safety systems.
[LG-14] Hardware-Efficient Neuro-Symbolic Networks with the Exp-Minus-Log Operator
链接: https://arxiv.org/abs/2604.13871
作者: Eymen Ipek
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Deep neural networks (DNNs) deliver state-of-the-art accuracy on regression and classification tasks, yet two structural deficits persistently obstruct their deployment in safety-critical, resource-constrained settings: (i) opacity of the learned function, which precludes formal verification, and (ii) reliance on heterogeneous, library-bound activation functions that inflate latency and silicon area on edge hardware. The recently introduced Exp-Minus-Log (EML) Sheffer operator, eml(x, y) = exp(x) - ln(y), was shown by Odrzywolek (2026) to be sufficient - together with the constant 1 - to express every standard elementary function as a binary tree of identical nodes. We propose to embed EML primitives inside conventional DNN architectures, yielding a hybrid DNN-EML model in which the trunk learns distributed representations and the head is a depth-bounded, weight-sparse EML tree whose snapped weights collapse to closed-form symbolic sub-expressions. We derive the forward equations, prove computational-cost bounds, analyse inference and training acceleration relative to multilayer perceptrons (MLPs) and physics-informed neural networks (PINNs), and quantify the trade-offs for FPGA/analog deployment. We argue that the DNN-EML pairing closes a literature gap: prior neuro-symbolic and equation-learner approaches (EQL, KAN, AI-Feynman) work with heterogeneous primitive sets and do not exploit a single hardware-realisable Sheffer element. A balanced assessment shows that EML is unlikely to accelerate training, and on commodity CPU/GPU it is also unlikely to accelerate inference; however, on a custom EML cell (FPGA logic block or analog circuit) the asymptotic latency advantage can reach an order of magnitude with simultaneous gain in interpretability and formal-verification tractability.
[LG-15] Simulation-Based Optimisation of Batting Order and Bowling Plans in T20 Cricket
链接: https://arxiv.org/abs/2604.13861
作者: Tinniam V Ganesh
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: Submitted to the Journal of Quantitative Analysis in Sports (JQAS), April 2026. 23 pages, 8 figures
Abstract:This paper develops a unified Markov Decision Process (MDP) framework for optimising two recurring in-match decisions in T20 cricket namely batting order selection and bowling plan assignment, directly in terms of win and defend probability rather than expected runs. A three-phase player profile engine (Powerplay, Middle, Death) with James-Stein shrinkage is estimated from 1,161 IPL ball-by-ball records (2008-2025). Win/defend probabilities are evaluated by vectorised Monte Carlo simulation over N = 50,000 innings trajectories. Batting orders are searched by exhaustive enumeration. Bowling plans are computed by simulated annealing over the remaining quota with the constraint that the same bowler cannot bowl consecutive overs. Applied to two 2026 IPL matches, the optimal batting order improves Mumbai Indians’ win probability by 4.1 percentage points (52.4% to 56.5%), and the optimal Gujarat Titans bowling plan improves defend probability by 5.2 percentage points (39.1% to 44.3%). In both cases the observed sub-optimality is consistent with phase-agnostic deployment in decisions that appear reasonable by aggregate metrics but are exposed as costly when phase-specific profiles are applied.
[LG-16] Randomized Neural Networks for Integro-Differential Equations with Application to Neutron Transport
链接: https://arxiv.org/abs/2604.13830
作者: Haoning Dang,Fei Wang,Yifan Chen,Zhouyu Liu,Dong Liu,Hongchun Wu
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Integro-differential equations arise in a wide range of applications, including transport, kinetic theory, radiative transfer, and multiphysics modeling, where nonlocal integral operators couple the solution across phase space. Such nonlocality often introduces dense coupling blocks in deterministic discretizations, leading to increased computational cost and memory usage, while physics-informed neural networks may suffer from expensive nonconvex training and sensitivity to hyperparameter choices. In this work, we present randomized neural networks (RaNNs) as a mesh-free collocation framework for linear integro-differential equations. Because the RaNN approximation is intrinsically dense through globally supported random features, the nonlocal integral operator does not introduce an additional loss of sparsity, while the approximate solution can still be represented with relatively few trainable degrees of freedom. By randomly fixing the hidden-layer parameters and solving only for the linear output weights, the training procedure reduces to a convex least-squares problem in the output coefficients, enabling stable and efficient optimization. As a representative application, we apply the proposed framework to the steady neutron transport equation, a high-dimensional linear integro-differential model featuring scattering integrals and diverse boundary conditions. Extensive numerical experiments demonstrate that, in the reported test settings, the RaNN approach achieves competitive accuracy while incurring substantially lower training cost than the selected neural and deterministic baselines, highlighting RaNNs as a robust and efficient alternative for the numerical simulation of nonlocal linear operators.
[LG-17] Beyond State Consistency: Behavior Consistency in Text-Based World Models
链接: https://arxiv.org/abs/2604.13824
作者: Youling Huang,Guanqiao Chen,Junchi Yao,Lu Wang,Fangkai Yang,Chao Du,ChenZhuo Zhao,Pu Zhao,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
类目: Machine Learning (cs.LG)
*备注: 20 pages, 2 figures
Abstract:World models have been emerging as critical components for assessing the consequences of actions generated by interactive agents in online planning and offline evaluation. In text-based environments, world models are typically evaluated and trained with single-step metrics such as Exact Match, aiming to improve the similarity between predicted and real-world states, but such metrics have been shown to be insufficient for capturing actual agent behavior. To address this issue, we introduce a new behavior-aligned training paradigm aimed at improving the functional consistency between the world model and the real environment. This paradigm focuses on optimizing a tractable step-level metric named Behavior Consistency Reward (BehR), which measures how much the likelihood of a logged next action changes between the real state and the world-model-predicted state under a frozen Reference Agent. Experiments on WebShop and TextWorld show that BehR-based training improves long-term alignment in several settings, with the clearest gains in WebShop and less movement in near-ceiling regimes, while preserving or improving single-step prediction quality in three of four settings. World models trained with BehR also achieve lower false positives in offline surrogate evaluation and show modest but encouraging gains in inference-time lookahead planning.
[LG-18] UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization
链接: https://arxiv.org/abs/2604.13822
作者: Zhengxi Lu,Fei Tang,Guangyi Liu,Kaitao Song,Xu Tan,Jin Ma,Wenqi Zhang,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot’s strong generalization to real-world GUI tasks.
[LG-19] RPS: Information Elicitation with Reinforcement Prompt Selection
链接: https://arxiv.org/abs/2604.13817
作者: Tao Wang,Jingyao Lu,Xibo Wang,Haonan Huang,Su Yao,Zhiqiang Hu,Xingyan Chen,Enmao Diao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) have shown remarkable capabilities in dialogue generation and reasoning, yet their effectiveness in eliciting user-known but concealed information in open-ended conversations remains limited. In many interactive AI applications, such as personal assistants, tutoring systems, and legal or clinical support, users often withhold sensitive or uncertain information due to privacy concerns, ambiguity, or social hesitation. This makes it challenging for LLMs to gather complete and contextually relevant inputs. In this work, we define the problem of information elicitation in open-ended dialogue settings and propose Reinforcement Prompt Selection (RPS), a lightweight reinforcement learning framework that formulates prompt selection as a sequential decision-making problem. To analyze this problem in a controlled setting, we design a synthetic experiment, where a reinforcement learning agent outperforms a random query baseline, illustrating the potential of policy-based approaches for adaptive information elicitation. Building on this insight, RPS learns a policy over a pool of prompts to adaptively elicit concealed or incompletely expressed information from users through dialogue. We also introduce IELegal, a new benchmark dataset constructed from real legal case documents, which simulates dialogue-based information elicitation tasks aimed at uncovering case-relevant facts. In this setting, RPS outperforms static prompt baselines, demonstrating the effectiveness of adaptive prompt selection for eliciting critical information in LLM-driven dialogue systems.
[LG-20] Composite Silhouette: A Subsampling-based Aggregation Strategy
链接: https://arxiv.org/abs/2604.13816
作者: Aggelos Semoglou,Aristidis Likas,John Pavlopoulos
类目: Machine Learning (cs.LG)
*备注: 32 pages including Appendix
Abstract:Determining the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are unavailable. The Silhouette coefficient is a widely used internal validation metric for this task, yet its standard micro-averaged form tends to favor larger clusters under size imbalance. Macro-averaging mitigates this bias by weighting clusters equally, but may overemphasize noise from under-represented groups. We introduce Composite Silhouette, an internal criterion for cluster-count selection that aggregates evidence across repeated subsampled clusterings rather than relying on a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is then obtained by averaging these subsample-level composites. We establish key properties of the criterion and derive finite-sample concentration guarantees for its subsampling estimate. Experiments on synthetic and real-world datasets show that Composite Silhouette effectively reconciles the strengths of micro- and macro-averaging, yielding more accurate recovery of the ground-truth number of clusters.
[LG-21] Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
链接: https://arxiv.org/abs/2604.13806
作者: Jaemin Kim,Sungkyun Kim,Junyeol Lee,Jiwon Seo
类目: Machine Learning (cs.LG)
*备注: EUROMLSYS 2026
Abstract:Large Language Models (LLMs) are widely used across many domains, but their scale makes deployment challenging. Post-Training Quantization (PTQ) reduces memory footprint without retraining by leveraging a small calibration set. Recent Hessian-based PTQ methods compensate quantization error via cross-channel dependencies, but such approaches degrade at low bit-widths due to noisy curvature estimates from limited calibration data. We propose DASH-Q, a robust PTQ framework using diagonal Hessian approximation and iterative weighted least squares. By discarding noise-prone dependencies, DASH-Q filters sampling noise while prioritizing the preservation of salient feature power. We outperform other PTQ baselines in ultra low-bit regime, improving zero-shot accuracy by 7.01% on average and up to 14.01% over the strongest baselines across five baseline LLM models, while showing robust and stable performance with very small calibration data.
[LG-22] Character Beyond Speech: Leverag ing Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
链接: https://arxiv.org/abs/2604.13804
作者: Dongjie Fu,Fangming Feng,Xize Cheng,Linjun Li,Zhou Zhao,Tao Jin
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge, an evaluation framework that leverages audio large language models to systematically assess the alignment between speech and character across multiple modalities and dimensions. Furthermore, we introduce RoleChat, the first voice role-playing evaluation dataset enriched with chain-of-thought reasoning annotations, comprising a diverse set of authentic and LLM-generated speech samples. Utilizing this dataset, we implement a multi-stage training paradigm and incorporate Standard Alignment in reinforcement learning to mitigate reward misalignment during optimization. Experimental results in terms of accuracy and subjective assessment demonstrate that RoleJudge outperforms various baseline models, validating the effectiveness of our multidimensional evaluation framework.
[LG-23] Online learning with noisy side observations AISTATS
链接: https://arxiv.org/abs/2604.13740
作者: Tomáš Kocák,Gergely Neu,Michal Valko
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published at International Conference on Artificial Intelligence and Statistics (AISTATS) 2016. 13 pages, 7 figures
Abstract:We propose a new partial-observability model for online learning problems where the learner, besides its own loss, also observes some noisy feedback about the other actions, depending on the underlying structure of the problem. We represent this structure by a weighted directed graph, where the edge weights are related to the quality of the feedback shared by the connected nodes. Our main contribution is an efficient algorithm that guarantees a regret of \widetildeO(\sqrt\alpha^* T) after T rounds, where \alpha^* is a novel graph property that we call the effective independence number. Our algorithm is completely parameter-free and does not require knowledge (or even estimation) of \alpha^* . For the special case of binary edge weights, our setting reduces to the partial-observability models of Mannor and Shamir (2011) and Alon et al. (2013) and our algorithm recovers the near-optimal regret bounds.
[LG-24] Spectral Thompson sampling AAAI
链接: https://arxiv.org/abs/2604.13739
作者: Tomas Kocak,Michal Valko,Remi Munos,Shipra Agrawal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published at AAAI Conference on Artificial Intelligence (AAAI) 2014
Abstract:Thompson Sampling (TS) has attracted a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis showing that the regret of SpectralTS scales as dsqrt(T ln N) with high probability, where T is the time horizon and N is the number of choices. Since a dsqrt(T ln N) regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is competitive on both synthetic and real-world data.
[LG-25] Physics-Informed Neural Networks for Solving Derivative-Constrained PDEs
链接: https://arxiv.org/abs/2604.13723
作者: Kentaro Hoshisashi,Carolyn E Phelan,Paolo Barucca
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Phys. Rev. E - Accepted 14 April, 2026
Abstract:Physics-Informed Neural Networks (PINNs) recast PDE solving as an optimisation problem in function space by minimising a residual-based objective, yet many applications require additional derivative-based relations that are just as fundamental as the governing equations. In this paper, we present Derivative-Constrained PINNs (DC-PINNs), a general framework that treats constrained PDE solving as an optimisation guided by a minimum objective function criterion where the physics resides in the minimum principle. DC-PINNs embed general nonlinear constraints on states and derivatives, e.g., bounds, monotonicity, convexity, incompressibility, computed efficiently via automatic differentiation, and they employ self-adaptive loss balancing to tune the influence of each objective, reducing reliance on manual hyperparameters and problem-specific architectures. DC-PINNs consistently reduce constraint violations and improve physical fidelity versus baseline PINN variants, representative hard-constraint formulations on benchmarks, including heat diffusion with bounds, financial volatilities with arbitrage-free, and fluid flow with vortices shed. Explicitly encoding derivative constraints stabilises training and steers optimisation toward physically admissible minima even when the PDE residual alone is small, providing reliable solutions of constrained PDEs grounded in energy minimum principles.
[LG-26] Optimization with SpotOptim
链接: https://arxiv.org/abs/2604.13672
作者: Thomas Bartz-Beielstein
类目: Machine Learning (cs.LG)
*备注:
Abstract:The spotoptim package implements surrogate-model-based optimization of expensive black-box functions in Python. Building on two decades of Sequential Parameter Optimization (SPO) methodology, it provides a Kriging-based optimization loop with Expected Improvement, support for continuous, integer, and categorical variables, noise-aware evaluation via Optimal Computing Budget Allocation (OCBA), and multi-objective extensions. A steady-state parallelization strategy overlaps surrogate search with objective evaluation on multi-core hardware, and a success-rate-based restart mechanism detects stagnation while preserving the best solution found. The package returns scipy-compatible OptimizeResult objects and accepts any scikit-learn-compatible surrogate model. Built-in TensorBoard logging provides real-time monitoring of convergence and surrogate quality. This report describes the architecture and module structure of spotoptim, provides worked examples including neural network hyperparameter tuning, and compares the framework with BoTorch, Optuna, Ray Tune, BOHB, SMAC, and Hyperopt. The package is open-source.
[LG-27] A Bayesian Framework for Uncertainty-Aware Explanations in Power Quality Disturbance Classification
链接: https://arxiv.org/abs/2604.13658
作者: Yinsong Chen,Samson S. Yu,Kashem M. Muttaqi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Advanced deep learning methods have shown remarkable success in power quality disturbance (PQD) classification. To enhance model transparency, explainable AI (XAI) techniques have been developed to provide instance-specific interpretations of classifier decisions. However, conventional XAI methods yield deterministic explanations, overlooking uncertainty and limiting reliability in safety-critical applications. This paper proposes a Bayesian explanation framework that models explanation uncertainty by generating a relevance attribution distribution for each instance. This method allows experts to select explanations based on confidence percentiles, thereby tailoring interpretability according to specific disturbance types. Extensive experiments on synthetic and real-world power quality datasets demonstrate that the proposed framework improves the transparency and reliability of PQD classifiers through uncertainty-aware explanations.
[LG-28] Self-Organizing Maps with Optimized Latent Positions IJCNN2026
链接: https://arxiv.org/abs/2604.13622
作者: Seiki Ubukata,Akira Notsu,Katsuhiro Honda
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures. Accepted for publication in the 2026 International Joint Conference on Neural Networks (IJCNN 2026), part of the 2026 IEEE World Congress on Computational Intelligence (WCCI 2026). This version is the author’s accepted manuscript
Abstract:Self-Organizing Maps (SOM) are a classical method for unsupervised learning, vector quantization, and topographic mapping of high-dimensional data. However, existing SOM formulations often involve a trade-off between computational efficiency and a clearly defined optimization objective. Objective-based variants such as Soft Topographic Vector Quantization (STVQ) provide a principled formulation, but their neighborhood-coupled computations become expensive as the number of latent nodes increases. In this paper, we propose Self-Organizing Maps with Optimized Latent Positions (SOM-OLP), an objective-based topographic mapping method that introduces a continuous latent position for each data point. Starting from the neighborhood distortion of STVQ, we construct a separable surrogate local cost based on its local quadratic structure and formulate an entropy-regularized objective based on it. This yields a simple block coordinate descent scheme with closed-form updates for assignment probabilities, latent positions, and reference vectors, while guaranteeing monotonic non-increase of the objective and retaining linear per-iteration complexity in the numbers of data points and latent nodes. Experiments on a synthetic saddle manifold, scalability studies on the Digits and MNIST datasets, and 16 benchmark datasets show that SOM-OLP achieves competitive neighborhood preservation and quantization performance, favorable scalability for large numbers of latent nodes and large datasets, and the best average rank among the compared methods on the benchmark datasets.
[LG-29] Reward Hacking in the Era of Large Models: Mechanisms Emergent Misalignment Challenges
链接: https://arxiv.org/abs/2604.13602
作者: Xiaohua Wang,Muzhao Tian,Yuqi Zeng,Zisu Huang,Jiakang Yuan,Bowen Chen,Jingwen Xu,Mingbo Zhou,Wenhao Liu,Muling Wu,Zhengkang Guo,Qi Qian,Yifei Wang,Feiran Zhang,Ruicheng Yin,Shihan Dou,Changze Lv,Tao Chen,Kaitao Song,Xu Tan,Tao Gui,Xiaoqing Zheng,Xuanjing Huang
类目: Machine Learning (cs.LG)
*备注: 42 pages, 5 figures, 2 tables
Abstract:Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception–reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator–policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.
[LG-30] Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning ACL2026
链接: https://arxiv.org/abs/2604.13598
作者: Qin Zhou,Guoyan Liang,Qianyi Yang,Jingyuan Chen,Sai Wu,Chang Yao,Zhe Wang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 13 pages,4 figures, ACL2026-main
Abstract:Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Reward (GEAR) delivers group-wise, evidence-aware feedback. GEAR reinforces consistent grounding for true positives, recovers missed findings for false negatives, and suppresses unsupported content for false positives. Second, a Self-correcting Preference Learning (SPL) strategy automatically constructs a reliable, disease-aware preference dataset from multiple noisy observations and leverages an LLM to synthesize refined reports without human supervision. ESC-RL promotes clinically faithful, disease-aligned reward and supports continual self-improvement during training. Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.
[LG-31] Parameter-efficient Quantum Multi-task Learning
链接: https://arxiv.org/abs/2604.13560
作者: Hevish Cowlessur,Chandra Thapa,Tansu Alpcan,Seyit Camtepe
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
*备注:
Abstract:Multi-task learning (MTL) improves generalization and data efficiency by jointly learning related tasks through shared representations. In the widely used hard-parameter-sharing setting, a shared backbone is combined with task-specific prediction heads. However, task-specific parameters can grow rapidly with the number of tasks. Therefore, designing multi-task heads that preserve task specialization while improving parameter efficiency remains a key challenge. In Quantum Machine Learning (QML), variational quantum circuits (VQCs) provide a compact mechanism for mapping classical data to quantum states residing in high-dimensional Hilbert spaces, enabling expressive representations within constrained parameter budgets. We propose a parameter-efficient quantum multi-task learning (QMTL) framework that replaces conventional task-specific linear heads with a fully quantum prediction head in a hybrid architecture. The model consists of a VQC with a shared, task-independent quantum encoding stage, followed by lightweight task-specific ansatz blocks enabling localized task adaptation while maintaining compact parameterization. Under a controlled and capacity-matched formulation where the shared representation dimension grows with the number of tasks, our parameter-scaling analysis demonstrates that a standard classical head exhibits quadratic growth, whereas the proposed quantum head parameter cost scales linearly. We evaluate QMTL on three multi-task benchmarks spanning natural language processing, medical imaging, and multimodal sarcasm detection, where we achieve performance comparable to, and in some cases exceeding, classical hard-parameter-sharing baselines while consistently outperforming existing hybrid quantum MTL models with substantially fewer head parameters. We further demonstrate QMTL’s executability on noisy simulators and real quantum hardware, illustrating its feasibility.
[LG-32] Learning Inference Concurrency in DynamicGate MLP Structural and Mathematical Justification
链接: https://arxiv.org/abs/2604.13546
作者: Yongil Choi
类目: Machine Learning (cs.LG)
*备注: 20 pages, 6 figures
Abstract:Conventional neural networks strictly separate learning and inference because if parameters are updated during inference, outputs become unstable and even the inference function itself is not well defined [1, 2, 3]. This paper shows that DynamicGate MLP structurally permits learning inference concurrency [4, 5]. The key idea is to separate routing (gating) parameters from representation (prediction) parameters, so that the gate can be adapted online while inference stability is preserved, or weights can be selectively updated only within the inactive subspace [4, 5, 6, 7]. We mathematically formalize sufficient conditions for concurrency and show that even under asynchronous or partial updates, the inference output at each time step can always be interpreted as a forward computation of a valid model snapshot [8, 9, 10]. This suggests that DynamicGate MLP can serve as a practical foundation for online adaptive and on device learning systems [11, 12].
[LG-33] Cross-Layer Co-Optimized LSTM Accelerator for Real-Time Gait Analysis
链接: https://arxiv.org/abs/2604.13543
作者: Mohammad Hasan Ahmadilivani,Levent Aksoy,Mohammad Eslami,Jaan Raik,Alar Kuusik
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 9 pages, 6 figues, 9 tables, accepted at IEEE ISQED’26
Abstract:Long Short-Term Memory (LSTM) neural networks have penetrated healthcare applications where real-time requirements and edge computing capabilities are essential. Gait analysis that detects abnormal steps to prevent patients from falling is a prominent problem for such applications. Given the extremely stringent design requirements in performance, power dissipation, and area, an Application-Specific Integrated Circuit (ASIC) enables an efficient real-time exploitation of LSTMs for gait analysis, achieving high accuracy. To the best of our knowledge, this work presents the first cross-layer co-optimized LSTM accelerator for real-time gait analysis, targeting an ASIC design. We conduct a comprehensive design space exploration from software down to layout design. We carry out a bit-width optimization at the software level with hardware-aware quantization to reduce the hardware complexity, explore various designs at the register-transfer level, and generate alternative layouts to find efficient realizations of the LSTM accelerator in terms of hardware complexity and accuracy. The physical synthesis results show that, using the 65 nm technology, the die size of the accelerator’s layout optimized for the highest accuracy is 0.325 mm^2, while the alternative design optimized for hardware complexity with a slightly lower accuracy occupies 15.4% smaller area. Moreover, the designed accelerators achieve accurate gait abnormality detection 4.05x faster than the given application requirement.
[LG-34] LEGO-MOF: Equivariant Latent Manipulation for Editable Generative and Optimizable MOF Design
链接: https://arxiv.org/abs/2604.13520
作者: Chaoran Zhang,Guangyao Li,Dongxu Ji
类目: Machine Learning (cs.LG)
*备注: 36 pages including Supplementary Information, 10 figures in the main text and 12 figures/tables in the Supplementary Information
Abstract:Metal-organic frameworks (MOFs) are highly promising for carbon capture, yet navigating their vast design space remains challenging. Recent deep generative models enable de novo MOF design but primarily act as feed-forward structure generators. By heavily relying on predefined building block libraries and non-differentiable post-optimization, they fundamentally sever the information flow required for continuous structural editing. Here, we propose a target-driven generative framework focused on continuous structural manipulation. At its core is LinkerVAE, which maps discrete 3D chemical graphs into a continuous, SE(3)-equivariant latent space. This smooth manifold unlocks geometry-aware manipulations, including implicit chemical style transfer and zero-shot isoreticular expansion. Building upon this, we introduce a test-time optimization (TTO) strategy, utilizing an accurate surrogate model to continuously optimize the latent graphs of existing MOFs toward desired properties. This approach systematically enhances carbon capture performance, achieving a striking average relative boost of 147.5% in pure CO2 uptake while strictly preserving structural validity. Integrated with a latent diffusion model and rigid-body assembly for full MOF construction, our framework establishes a scalable, fully differentiable pathway for both the automated discovery, targeted optimization and editing of functional materials.
[LG-35] Computational framework for multistep metabolic pathway design
链接: https://arxiv.org/abs/2604.13471
作者: Peter Zhiping Zhang,Jeffrey D. Varner
类目: Machine Learning (cs.LG)
*备注:
Abstract:In silico tools are important for generating novel hypotheses and exploring alternatives in de novo metabolic pathway design. However, while many computational frameworks have been proposed for retrobiosynthesis, few successful examples of algorithm-guided xenobiotic biochemical retrosynthesis have been reported in the literature. Deep learning has improved the quality of synthesis and retrosynthesis in organic chemistry applications. Inspired by this progress, we explored combining deep learning of biochemical transformations with the traditional retrobiosynthetic workflow to improve in silico synthetic metabolic pathway designs. To develop our computational biosynthetic pathway design framework, we assembled metabolic reaction and enzymatic template data from public databases. A data augmentation procedure, adapted from literature, was carried out to enrich the assembled reaction dataset with artificial metabolic reactions generated by enzymatic reaction templates. Two neural network-based pathway ranking models were trained as binary classifiers to distinguish assembled reactions from artificial counterparts; each model output a scalar quantifying the plausibility of a 1-step or 2-step pathway. Combining these two models with enzymatic templates, we built a multistep retrobiosynthesis pipeline and validated it by reproducing some natural and non-natural pathways computationally.
[LG-36] Universality of Gaussian-Mixture Reverse Kernels in Conditional Diffusion
链接: https://arxiv.org/abs/2604.13470
作者: Nafiz Ishtiaque,Syed Arefinul Haque,Kazi Ashraful Alam,Fatima Jahara
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10+19 pages
Abstract:We prove that conditional diffusion models whose reverse kernels are finite Gaussian mixtures with ReLU-network logits can approximate suitably regular target distributions arbitrarily well in context-averaged conditional KL divergence, up to an irreducible terminal mismatch that typically vanishes with increasing diffusion horizon. A path-space decomposition reduces the output error to this mismatch plus per-step reverse-kernel errors; assuming each reverse kernel factors through a finite-dimensional feature map, each step becomes a static conditional density approximation problem, solved by composing Norets’ Gaussian-mixture theory with quantitative ReLU bounds. Under exact terminal matching the resulting neural reverse-kernel class is dense in conditional KL.
[LG-37] Adaptive Unknown Fault Detection and Few-Shot Continual Learning for Condition Monitoring in Ultrasonic Metal Welding
链接: https://arxiv.org/abs/2604.13465
作者: Ahmadreza Eslaminia,Kuan-Chieh Lu,Klara Nahrstedt,Chenhui Shao
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 20 pages, 10 figures
Abstract:Ultrasonic metal welding (UMW) is widely used in industrial applications but is sensitive to tool wear, surface contamination, and material variability, which can lead to unexpected process faults and unsatisfactory weld quality. Conventional monitoring systems typically rely on supervised learning models that assume all fault types are known in advance, limiting their ability to handle previously unseen process faults. To address this challenge, this paper proposes an adaptive condition monitoring approach that enables unknown fault detection and few-shot continual learning for UMW. Unknown faults are detected by analyzing hidden-layer representations of a multilayer perceptron and leveraging a statistical thresholding strategy. Once detected, the samples from unknown fault types are incorporated into the existing model through a continual learning procedure that selectively updates only the final layers of the network, which enables the model to recognize new fault types while preserving knowledge of existing classes. To accelerate the labeling process, cosine similarity transformation combined with a clustering algorithm groups similar unknown samples, thereby reducing manual labeling effort. Experimental results using a multi-sensor UMW dataset demonstrate that the proposed method achieves 96% accuracy in detecting unseen fault conditions while maintaining reliable classification of known classes. After incorporating a new fault type using only five labeled samples, the updated model achieves 98% testing classification accuracy. These results demonstrate that the proposed approach enables adaptive monitoring with minimal retraining cost and time. The proposed approach provides a scalable solution for continual learning in condition monitoring where new process conditions may constantly emerge over time and is extensible to other manufacturing processes.
[LG-38] FAST: A Synergistic Framework of Attention and State-space Models for Spatiotemporal Traffic Prediction ICME2026
链接: https://arxiv.org/abs/2604.13453
作者: Xinjin Li,Jinghan Cao,Mengyue Wang,Yue Wu,Longxiang Yan,Yeyang Zhou,Ziqi Sha,Yu Ma
类目: Machine Learning (cs.LG)
*备注: Accepted by ICME 2026
Abstract:Traffic forecasting requires modeling complex temporal dynamics and long-range spatial dependencies over large sensor networks. Existing methods typically face a trade-off between expressiveness and efficiency: Transformer-based models capture global dependencies well but suffer from quadratic complexity, while recent selective state-space models are computationally efficient yet less effective at modeling spatial interactions in graph-structured traffic data. We propose FAST, a unified framework that combines attention and state-space modeling for scalable spatiotemporal traffic forecasting. FAST adopts a Temporal-Spatial-Temporal architecture, where temporal attention modules capture both short- and long-term temporal patterns, and a Mamba-based spatial module models long-range inter-sensor dependencies with linear complexity. To better represent heterogeneous traffic contexts, FAST further introduces a learnable multi-source spatiotemporal embedding that integrates historical traffic flow, temporal context, and node-level information, together with a multi-level skip prediction mechanism for hierarchical feature fusion. Experiments on PeMS04, PeMS07, and PeMS08 show that FAST consistently outperforms strong baselines from Transformer-, GNN-, attention-, and Mamba-based families. In particular, FAST achieves the best MAE and RMSE on all three benchmarks, with up to 4.3% lower RMSE and 2.8% lower MAE than the strongest baseline, demonstrating a favorable balance between accuracy, scalability, and generalization.
[LG-39] WIN-U: Woodbury-Informed Newton-Unlearning as a retain-free Machine Unlearning Framework
链接: https://arxiv.org/abs/2604.13438
作者: Xingjian Zhao,Mohammad Mohammadi Amiri,Malik Magdon-Ismail
类目: Machine Learning (cs.LG)
*备注: 21 pages, 3 figures, under review at COLM2026
Abstract:Privacy concerns in LLMs have led to the rapidly growing need to enforce a data’s “right to be forgotten”. Machine unlearning addresses precisely this task, namely the removal of the influence of some specific data, i.e., the forget set, from a trained model. The gold standard for unlearning is to produce the model that would have been learned on only the rest of the training data, i.e., the retain set. Most existing unlearning methods rely on direct access to the retained data, which may not be practical due to privacy or cost constraints. We propose WIN-U, a retained-data free unlearning framework that requires only second order information for the originally trained model on the full data. The unlearning is performed using a single Newton-style step. Using the Woodbury matrix identity and a generalized Gauss-Newton approximation for the forget set curvature, the WIN-U update recovers the closed-form linear solution and serves as a local second-order approximation to the gold-standard retraining optimum. Extensive experiments on various vision and language benchmarks demonstrate that WIN-U achieves SOTA performance in terms of unlearning efficacy and utility preservation, while being more robust against relearning attacks compared to existing methods. Importantly, WIN-U does not require access to the retained data.
[LG-40] Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
链接: https://arxiv.org/abs/2604.13413
作者: Zhengyu Fang,Zhimeng Jiang,Huiyuan Chen,Xiaoge Zhang,Tianyi Li,Kaiyu Tang,Xiao Li,Jing Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion language models (DLMs) have emerged as a promising paradigm for large language models (LLMs), yet the non-deterministic behavior of DLMs remains poorly understood. The existing non-determinism evaluations for LLMs predominantly rely on dataset-level metrics under fixed inference configurations, providing limited insight into how model behavior varies across runs and evaluation conditions. In this work, we show that dataset-level metrics systematically attenuate non-determinism in diffusion language models by aggregating sample-level prediction quality across different runs. As a result, configurations with similar aggregate performance can exhibit substantially different behaviors on individual inputs, leaving fine-grained instability and distinct error patterns uncharacterized. To address this limitation, we conduct a fine-grained evaluation of non-determinism based on sample-level prediction differences across a range of model-related factors-including guidance scale, diffusion steps, and Monte Carlo sampling-as well as system-related factors such as batch size, hardware, and numerical precision. Our analysis reveals that non-determinism in DLMs is pervasive and structured, with code generation exhibiting markedly higher sensitivity to factor-level choices than question answering. To attribute sources of non-determinism evaluation, we introduce Factor Variance Attribution (FVA), a cross-factor analysis metric that decomposes observed non-determinism into variance attributable to different evaluation factor settings. Our findings highlight the need for fine-grained, factor-aware evaluation to enable reliable non-determinism assessment of diffusion language models.
[LG-41] Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling
链接: https://arxiv.org/abs/2604.13386
作者: Erik Nordby,Tasha Pais,Aviel Parrack
类目: Machine Learning (cs.LG)
*备注:
Abstract:Linear probes can detect when language models produce outputs they “know” are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Across 12 models (0.5B–176B parameters), we find probe accuracy improves with scale: ~5% AUROC per 10x parameters (R=0.81). Geometrically, deception directions rotate gradually across layers rather than appearing at one location, explaining both why single-layer probes are brittle and why multi-layer ensembles succeed.
[LG-42] Diffusion Sequence Models for Generative In-Context Meta-Learning of Robot Dynamics
链接: https://arxiv.org/abs/2604.13366
作者: Angelo Moroncelli,Matteo Rufolo,Gunes Cagin Aydin,Asad Ali Shahid,Loris Roveda
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Angelo Moroncelli, Matteo Rufolo and Gunes Cagin Aydin contributed equally to this work
Abstract:Accurate modeling of robot dynamics is essential for model-based control, yet remains challenging under distributional shifts and real-time constraints. In this work, we formulate system identification as an in-context meta-learning problem and compare deterministic and generative sequence models for forward dynamics prediction. We take a Transformer-based meta-model, as a strong deterministic baseline, and introduce to this setting two complementary diffusion-based approaches: (i) inpainting diffusion (Diffuser), which learns the joint input-observation distribution, and (ii) conditioned diffusion models (CNN and Transformer), which generate future observations conditioned on control inputs. Through large-scale randomized simulations, we analyze performance across in-distribution and out-of-distribution regimes, as well as computational trade-offs relevant for control. We show that diffusion models significantly improve robustness under distribution shift, with inpainting diffusion achieving the best performance in our experiments. Finally, we demonstrate that warm-started sampling enables diffusion models to operate within real-time constraints, making them viable for control applications. These results highlight generative meta-models as a promising direction for robust system identification in robotics.
[LG-43] BioTrain: Sub-MB Sub-50mW On-Device Fine-Tuning for Edge-AI on Biosignals
链接: https://arxiv.org/abs/2604.13359
作者: Run Wang,Victor J. B. Jung,Philip Wiese,Sebastian Frey,Giusy Spacone,Francesco Conti,Alessio Burrello,Luca Benin
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Signal Processing (eess.SP)
*备注:
Abstract:Biosignals exhibit substantial cross-subject and cross-session variability, inducing severe domain shifts that degrade post-deployment performance for small, edge-oriented AI models. On-device adaptation is therefore essential to both preserve user privacy and ensure system reliability. However, existing sub-100 mW MCU-based wearable platforms can only support shallow or sparse adaptation schemes due to the prohibitive memory footprint and computational cost of full backpropagation (BP). In this paper, we propose BioTrain, a framework enabling full-network fine-tuning of state-of-the-art biosignal models under milliwatt-scale power and sub-megabyte memory constraints. We validate BioTrain using both offline and on-device benchmarks on EEG and EOG datasets, covering Day-1 new-subject calibration and longitudinal adaptation to signal drift. Experimental results show that full-network fine-tuning achieves accuracy improvements of up to 35% over non-adapted baselines and outperforms last-layer updates by approximately 7% during new-subject calibration. On the GAP9 MCU platform, BioTrain enables efficient on-device training throughput of 17 samples/s for EEG and 85 samples/s for EOG models within a power envelope below 50 mW. In addition, BioTrain’s efficient memory allocator and network topology optimization enable the use of a large batch size, reducing peak memory usage. For fully on-chip BP on GAP9, BioTrain reduces the memory footprint by 8.1x, from 5.4 MB to 0.67 MB, compared to conventional full-network fine-tuning using batch normalization with batch size 8.
[LG-44] When Less Latent Leads to Better Relay: Information-Preserving Compression for Latent Multi-Agent LLM Collaboration
链接: https://arxiv.org/abs/2604.13349
作者: Yiping Li,Zhiyu An,Wan Du
类目: Machine Learning (cs.LG)
*备注:
Abstract:Communication in Large Language Model (LLM)-based multi-agent systems is moving beyond discrete tokens to preserve richer context. Recent work such as LatentMAS enables agents to exchange latent messages through full key-value (KV) caches. However, full KV relay incurs high memory and communication cost. We adapt eviction-style KV compression to this setting and introduce Orthogonal Backfill (OBF) to mitigate information loss from hard eviction. OBF injects a low-rank orthogonal residual from discarded KV states into the retained KV states. We evaluate proposed method against full KV relay on nine standard benchmarks spanning mathematical reasoning, coding, and knowledge-intensive QA. It achieves performance comparable to full KV relay while reducing communication cost by 79.8%–89.4%. OBF further improves the performance and achieves the best results on 7 of the 9 benchmarks. This suggests that more information does not necessarily lead to better communication; preserving the most useful information matters more. Our codebase is publicly available on this https URL.
[LG-45] Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models
链接: https://arxiv.org/abs/2604.13332
作者: Jingyun Jia,Chandan Singh,Rich Caruana,Ben Lengerich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Identifying meaningful feature interactions is a central challenge in building accurate and interpretable models for tabular data. Generalized additive models (GAMs) have shown great success at modeling tabular data, but often rely on heuristic procedures to select interactions, potentially missing higher-order or context-dependent effects. To meet this challenge, we propose TabDistill, a method that leverages tabular foundation models and post-hoc distillation methods. Our key intuition is that tabular foundation models implicitly learn rich, adaptive feature dependencies through large-scale representation learning. Given a dataset, TabDistill first fits a tabular foundation model to the dataset, and then applies a post-hoc interaction attribution method to extract salient feature interactions from it. We evaluate these interactions by then using them as terms in a GAM. Across tasks, we find that interactions identified by TabDistill lead to consistent improvements in downstream GAMs’ predictive performance. Our results suggest that tabular foundation models can serve as effective, data-driven guides for interaction discovery, bridging high-capacity models and interpretable additive frameworks.
[LG-46] xt-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation ACL2026
链接: https://arxiv.org/abs/2604.13331
作者: Mohsen Nayebi Kerdabadi,Arya Hadizadeh Moghaddam,Chen Chen,Dongjie Wang,Zijun Yao
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted at ACL 2026 main conference
Abstract:In electronic health record (EHR) mining, learning high-quality representations of medical concepts (e.g., standardized diagnosis, medication, and procedure codes) is fundamental for downstream clinical prediction. However, robust concept representation learning is hindered by two key challenges: (i) clinically important cross-type dependencies (e.g., diagnosis-medication and medication-procedure relations) are often missing or incomplete in existing ontology resources, limiting the ability to model complex EHR patterns; and (ii) rich clinical semantics are often missing from structured resources, and even when available as text, are difficult to integrate with KG structure for representation learning. To address these challenges, we present CoMed, an LLM-empowered graph learning framework for medical concept representation. CoMed first builds a global knowledge graph (KG) over medical codes by combining statistically reliable associations mined from EHRs with type-constrained LLM prompting to infer semantic relations. It then utilizes LLMs to enrich the KG into a text-attributed graph by generating node descriptions and edge rationales, providing semantic signals for both concepts and their relationships. Finally, CoMed jointly trains a LoRA-tuned LLaMA text encoder with a heterogeneous GNN, fusing text semantics and graph structure into unified concept embeddings. Extensive experiments on MIMIC-III and MIMIC-IV show that CoMed consistently improves prediction performance and serves as an effective plug-in concept encoder for standard EHR pipelines.
[LG-47] Multi-Task LLM with LoRA Fine-Tuning for Automated Cancer Staging and Biomarker Extraction
链接: https://arxiv.org/abs/2604.13328
作者: Jiahao Shao,Anam Nawaz Khan,Christopher Brett,Tom Berg,Xueping Li,Bing Yao
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures and 4 tables in the main manuscript. Additional content, figures and tables are in supplementary material section. 17 pages in total
Abstract:Pathology reports serve as the definitive record for breast cancer staging, yet their unstructured format impedes large-scale data curation. While Large Language Models (LLMs) offer semantic reasoning, their deployment is often limited by high computational costs and hallucination risks. This study introduces a parameter-efficient, multi-task framework for automating the extraction of Tumor-Node-Metastasis (TNM) staging, histologic grade, and biomarkers. We fine-tune a Llama-3-8B-Instruct encoder using Low-Rank Adaptation (LoRA) on a curated, expert-verified dataset of 10,677 reports. Unlike generative approaches, our architecture utilizes parallel classification heads to enforce consistent schema adherence. Experimental results demonstrate that the model achieves a Macro F1 score of 0.976, successfully resolving complex contextual ambiguities and heterogeneous reporting formats that challenge traditional extraction methods including rule-based natural language processing (NLP) pipelines, zero-shot LLMs, and single-task LLM baselines. The proposed adapter-efficient, multi-task architecture enables reliable, scalable pathology-derived cancer staging and biomarker profiling, with the potential to enhance clinical decision support and accelerate data-driven oncology research.
[LG-48] Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel
链接: https://arxiv.org/abs/2604.13327
作者: Hongyi Jin,Bohan Hou,Guanjie Wang,Ruihang Lai,Jinqi Chen,Zihao Ye,Yaxing Cai,Yixin Dong,Xinhao Cheng,Zhihao Zhang,Yilong Zhao,Yingyi Huang,Lijie Yang,Jinchen Jiang,Gabriele Oliaro,Jianan Ji,Xupeng Miao,Vinod Grover,Todd C. Mowry,Zhihao Jia,Tianqi Chen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 16 pages. 18 figures. accepted in MLSys 2026
Abstract:Modern GPU workloads, especially large language model (LLM) inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor encodes dependencies between tiled tasks, and enables first-class support for both shape and data-dependent dynamism. Built atop this abstraction, our Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels. Evaluations show that ETC achieves state-of-the-art LLM serving latency while significantly reducing system warmup overhead.
[LG-49] Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
链接: https://arxiv.org/abs/2604.13313
作者: Eun Woo Im,Dhruv Madhwal,Vivek Gupta
类目: Machine Learning (cs.LG)
*备注: 10 pages
Abstract:Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.
[LG-50] Structure- and Stability-Preserving Learning of Port-Hamiltonian Systems
链接: https://arxiv.org/abs/2604.13297
作者: Binh Nguyen,Nam T. Nguyen,Truong X. Nghiem
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:This paper investigates the problem of data-driven modeling of port-Hamiltonian systems while preserving their intrinsic Hamiltonian structure and stability properties. We propose a novel neural-network-based port-Hamiltonian modeling technique that relaxes the convexity constraint commonly imposed by neural network-based Hamiltonian approximations, thereby improving the expressiveness and generalization capability of the model. By removing this restriction, the proposed approach enables the use of more general non-convex Hamiltonian representations to enhance modeling flexibility and accuracy. Furthermore, the proposed method incorporates information about stable equilibria into the learning process, allowing the learned model to preserve the stability of multiple isolated equilibria rather than being restricted to a single equilibrium as in conventional methods. Two numerical experiments are conducted to validate the effectiveness of the proposed approach and demonstrate its ability to achieve more accurate structure- and stability-preserving learning of port-Hamiltonian systems compared with a baseline method.
[LG-51] Some Theoretical Limitations of t-SNE
链接: https://arxiv.org/abs/2604.13295
作者: Rupert Li,Elchanan Mossel
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 19 pages, 7 figures
Abstract:t-SNE has gained popularity as a dimension reduction technique, especially for visualizing data. It is well-known that all dimension reduction techniques may lose important features of the data. We provide a mathematical framework for understanding this loss for t-SNE by establishing a number of results in different scenarios showing how important features of data are lost by using t-SNE.
[LG-52] Physics-informed reservoir characterization from bulk and extreme pressure events with a differentiable simulator
链接: https://arxiv.org/abs/2604.13291
作者: Harun Ur Rashid,Mingxin Li,Aleksandra Pachalieva,Georg Stadler,Daniel O’Malley
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate characterization of subsurface heterogeneity is challenging but essential for applications such as reservoir pressure management, geothermal energy extraction and CO _2 , H _2 , and wastewater injection operations. This challenge becomes especially acute in extreme pressure events, which are rarely observed but can strongly affect operational risk. Traditional history matching and inversion techniques rely on expensive full-physics simulations, making it infeasible to handle uncertainty and extreme events at scale. Purely data-driven models often struggle to maintain physics consistency when dealing with sparse observations, complex geology, and extreme events. To overcome these limitations, we introduce a physics-informed machine learning method that embeds a differentiable subsurface flow simulator directly into neural network training. The network infers heterogeneous permeability fields from limited pressure observations, while training minimizes both permeability and pressure losses through the simulator, enforcing physical consistency. Because the simulator is used only during training, inference remains fast once the model is learned. In an initial test, the proposed method reduces the pressure inference error by half compared with a purely data-driven approach. We then extend the test over eight distinct data scenarios, and in every case, our method produces significantly lower pressure inference errors than the purely data-driven model. We also evaluate our method on extreme events, which represent high-consequence data in the tail of the sample distribution. Similar to the bulk distribution, the physics-informed model maintains higher pressure inference accuracy in the extreme event regimes. Overall, the proposed method enables rapid, physics-consistent subsurface inversion for real-time reservoir characterization and risk-aware decision-making.
[LG-53] MOONSHOT : A Framework for Multi-Objective Pruning of Vision and Large Language Models
链接: https://arxiv.org/abs/2604.13287
作者: Gabriel Afriat,Xiang Meng,Shibal Ibrahim,Hussein Hazimeh,Rahul Mazumder
类目: Machine Learning (cs.LG)
*备注:
Abstract:Weight pruning is a common technique for compressing large neural networks. We focus on the challenging post-training one-shot setting, where a pre-trained model is compressed without any retraining. Existing one-shot pruning methods typically optimize a single objective, such as a layer-wise reconstruction loss or a second-order Taylor approximation of the training loss. We highlight that neither objective alone is consistently the most effective across architectures and sparsity levels. Motivated by this insight, we propose MOONSHOT, a general and flexible framework that extends any single-objective pruning method into a multi-objective formulation by jointly optimizing both the layer-wise reconstruction error and second-order Taylor approximation of the training loss. MOONSHOT acts as a wrapper around existing pruning algorithms. To enable this integration while maintaining scalability to billion-parameter models, we propose modeling decisions and introduce an efficient procedure for computing the inverse Hessian, preserving the efficiency of state-of-the-art one-shot pruners. When combined with state-of-the-art pruning methods on Llama-3.2 and Llama-2 models, MOONSHOT reduces C4 perplexity by up to 32.6% at 2:4 sparsity and improves zero-shot mean accuracy across seven classification benchmarks by up to 4.9 points. On Vision Transformers, it improves accuracy on ImageNet-1k by over 5 points at 70% sparsity, and on ResNet-50, it yields a 4-point gain at 90% sparsity.
[LG-54] Enhancing Confidence Estimation in Telco LLM s via Twin-Pass CoT-Ensembling
链接: https://arxiv.org/abs/2604.13271
作者: Anton Saenko,Pranshav Gajjar,Abiodun Ganiyu,Vijay K. Shah
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) are increasingly applied to complex telecommunications tasks, including 3GPP specification analysis and O-RAN network troubleshooting. However, a critical limitation remains: LLM-generated confidence scores are often biased and unreliable, frequently exhibiting systematic overconfidence. This lack of trustworthy self-assessment makes it difficult to verify model outputs and safely rely on them in practice. In this paper, we study confidence calibration in telecom-domain LLMs using the representative Gemma-3 model family (4B, 12B, and 27B parameters), evaluated on TeleQnA, ORANBench, and srsRANBench. We show that standard single-pass, verbalized confidence estimates fail to reflect true correctness, often assigning high confidence to incorrect predictions. To address this, we propose a novel Twin-Pass Chain of Thought (CoT)-Ensembling methodology for improving confidence estimation by leveraging multiple independent reasoning evaluations and aggregating their assessments into a calibrated confidence score. Our approach reduces Expected Calibration Error (ECE) by up to 88% across benchmarks, significantly improving the reliability of model self-assessment. These results highlight the limitations of current confidence estimation practices and demonstrate a practical path toward more trustworthy evaluation of LLM outputs in telecommunications.
[LG-55] Binomial Gradient-Based Meta-Learning for Enhanced Meta-Gradient Estimation ICLR2026
链接: https://arxiv.org/abs/2604.13263
作者: Yilang Zhang,Abraham Jaeger Mountain,Bingcong Li,Georgios B. Giannakis
类目: Machine Learning (cs.LG)
*备注: Accepted as poster at ICLR 2026. Code available at this https URL
Abstract:Meta-learning offers a principled framework leveraging \emphtask-invariant priors from related tasks, with which \emphtask-specific models can be fine-tuned on downstream tasks, even with limited data records. Gradient-based meta-learning (GBML) relies on gradient descent (GD) to adapt the prior to a new task. Albeit effective, these methods incur high computational overhead that scales linearly with the number of GD steps. To enhance efficiency and scalability, existing methods approximate the gradient of prior parameters (meta-gradient) via truncated backpropagation, yet suffer large approximation errors. Targeting accurate approximation, this work puts forth binomial GBML (BinomGBML), which relies on a truncated binomial expansion for meta-gradient estimation. This novel expansion endows more information in the meta-gradient estimation via efficient parallel computation. As a running paradigm applied to model-agnostic meta-learning (MAML), the resultant BinomMAML provably enjoys error bounds that not only improve upon existing approaches, but also decay super-exponentially under mild conditions. Numerical tests corroborate the theoretical analysis and showcase boosted performance with slightly increased computational overhead.
[LG-56] Counterfactual Peptide Editing for Causal TCR–pMHC Binding Inference
链接: https://arxiv.org/abs/2604.13256
作者: Sanjar Khudoyberdiev,Arman Bekov
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:
Abstract:Neural models for TCR-pMHC binding prediction are susceptible to shortcut learning: they exploit spurious correlations in training data – such as peptide length bias or V-gene co-occurrence – rather than the physical binding interface. This renders predictions brittle under family-held-out and distance-aware evaluation, where such shortcuts do not transfer. We introduce \emphCounterfactual Invariant Prediction (CIP), a training framework that generates biologically constrained counterfactual peptide edits and enforces invariance to edits at non-anchor positions while amplifying sensitivity at MHC anchor residues. CIP augments the base classifier with two auxiliary objectives: (1) an invariance loss penalizing prediction changes under conservative non-anchor substitutions, and (2) a contrastive loss encouraging large prediction changes under anchor-position disruptions. Evaluated on a curated VDJdb-IEDB benchmark under family-held-out, distance-aware, and random splits, CIP achieves AUROC 0.831 and counterfactual consistency (CFC) 0.724 under the challenging family-held-out protocol – a 39.7% reduction in shortcut index relative to the unconstrained baseline. Ablations confirm that anchor-aware edit generation is the dominant driver of OOD gains, providing a practical recipe for causally-grounded TCR specificity modeling.
[LG-57] Bias-Corrected Adaptive Conformal Inference for Multi-Horizon Time Series Forecasting
链接: https://arxiv.org/abs/2604.13253
作者: Ankit Lade,Sai Krishna J.,Indar Kumar
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 14 pages, 3 figures, 2 tables. Preprint
Abstract:Adaptive Conformal Inference (ACI) provides distribution-free prediction intervals with asymptotic coverage guarantees for time series under distribution shift. However, ACI only adapts the quantile threshold – it cannot shift the interval center. When a base forecaster develops persistent bias after a regime change, ACI compensates by widening intervals symmetrically, producing unnecessarily conservative bands. We propose Bias-Corrected ACI (BC-ACI), which augments standard ACI with an online exponentially weighted moving average (EWM) estimate of forecast bias. BC-ACI corrects nonconformity scores before quantile computation and re-centers prediction intervals, addressing the root cause of miscalibration rather than its symptom. An adaptive dead-zone threshold suppresses corrections when estimated bias is indistinguishable from noise, ensuring no degradation on well-calibrated data. In controlled experiments across 688 runs spanning two base models, four synthetic regimes, and three real datasets, BC-ACI reduces Winkler interval scores by 13–17% under mean and compound distribution shifts (Wilcoxon p 0.001) while maintaining equivalent performance on stationary data (ratio 1.002x). We provide finite-sample analysis showing that coverage guarantees degrade gracefully with bias estimation error.
[LG-58] Analog Optical Inference on Million-Record Mortgage Data
链接: https://arxiv.org/abs/2604.13251
作者: Sofia Berloff,Pavel Koptev,Konstantin Malkov
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注: 12 pages, 5 figures
Abstract:Analog optical computers promise large efficiency gains for machine learning inference, yet no demonstration has moved beyond small-scale image benchmarks. We benchmark the analog optical computer (AOC) digital twin on mortgage approval classification from 5.84 million U.S. HMDA records and separate three sources of accuracy loss. On the original 19 features, the AOC reaches 94.6% balanced accuracy with 5,126 parameters (1,024 optical), compared with 97.9% for XGBoost; the 3.3 percentage-point gap narrows by only 0.5pp when the optical core is widened from 16 to 48 channels, suggesting an architectural rather than hardware limitation. Restricting all models to a shared 127-bit binary encoding drops every model to 89.4–89.6%, with an encoding cost of 8pp for digital models and 5pp for the AOC. Seven calibrated hardware non-idealities impose no measurable penalty. The three resulting layers of limitation (encoding, architecture, hardware fidelity) locate where accuracy is lost and what to improve next.
[LG-59] Does Dimensionality Reduction via Random Projections Preserve Landscape Features?
链接: https://arxiv.org/abs/2604.13230
作者: Iván Olarte Rodríguez,Anja Jankovic,Thomas Bäck,Elena Raponi
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 9 Pages, 5 figures, Submitted and accepted to Proceedings of The Genetic and Evolutionary Computation Conference 2026,
Abstract:Exploratory Landscape Analysis (ELA) provides numerical features for characterizing black-box optimization problems. In high-dimensional settings, however, ELA suffers from sparsity effects, high estimator variance, and the prohibitive cost of computing several feature classes. Dimensionality reduction has therefore been proposed as a way to make ELA applicable in such settings, but it remains unclear whether features computed in reduced spaces still reflect intrinsic properties of the original landscape. In this work, we investigate the robustness of ELA features under dimensionality reduction via Random Gaussian Embeddings (RGEs). Starting from the same sampled points and objective values, we compute ELA features in projected spaces and compare them to those obtained in the original search space across multiple sample budgets and embedding dimensions. Our results show that linear random projections often alter the geometric and topological structure relevant to ELA, yielding feature values that are no longer representative of the original problem. While a small subset of features remains comparatively stable, most are highly sensitive to the embedding. Moreover, robustness under projection does not necessarily imply informativeness, as apparently robust features may still reflect projection-induced artifacts rather than intrinsic landscape characteristics. Comments: 9 Pages, 5 figures, Submitted and accepted to Proceedings of The Genetic and Evolutionary Computation Conference 2026, Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) ACMclasses: J.6; F.2.2 Cite as: arXiv:2604.13230 [cs.LG] (or arXiv:2604.13230v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.13230 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-60] Fast Voxelization and Level of Detail for Microgeometry Rendering
链接: https://arxiv.org/abs/2604.13191
作者: Javier Fabre,Carlos Castillo,Carlos Rodriguez-Pardo,Jorge Lopez-Moreno
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted for publication in The Visual Computer. 16 pages, 7 figures, 3 tables. Supplementary material: this https URL
Abstract:Many materials show anisotropic light scattering patterns due to the shape and local alignment of their underlying micro structures: surfaces with small elements such as fibers, or the ridges of a brushed metal, are very sparse and require a high spatial resolution to be properly represented as a volume. The acquisition of voxel data from such objects is a time and memory-intensive task, and most rendering approaches require an additional Level-of-Detail (LoD) data structure to aggregate the visual appearance, as observed from multiple distances, in order to reduce the number of samples computed per pixel (E.g.: MIP mapping). In this work we introduce first, an efficient parallel voxelization method designed to facilitate fast data aggregation at multiple resolution levels, and second, a novel representation based on hierarchical SGGX clustering that provides better accuracy than baseline methods. We validate our approach with a CUDA-based implementation of the voxelizer, tested both on triangle meshes and volumetric fabrics modeled with explicit fibers. Finally, we show the results generated with a path tracer based on the proposed LoD rendering model.
[LG-61] Automated co-design of high-performance thermodynamic cycles via graph-based hierarchical reinforcement learning
链接: https://arxiv.org/abs/2604.13133
作者: Wenqing Li,Xu Feng,Peixue Jiang,Yinhai Zhu
类目: Machine Learning (cs.LG)
*备注: 21 pages,8 figures
Abstract:Thermodynamic cycles are pivotal in determining the efficacy of energy conversion systems. Traditional design methodologies, which rely on expert knowledge or exhaustive enumeration, are inefficient and lack scalability, thereby constraining the discovery of high-performance cycles. In this study, we introduce a graph-based hierarchical reinforcement learning approach for the co-design of structure parameters in thermodynamic cycles. These cycles are encoded as graphs, with components and connections depicted as nodes and edges, adhering to grammatical constraints. A deep learning-based thermophysical surrogate facilitates stable graph decoding and the simultaneous resolution of global parameters. Building on this foundation, we develop a hierarchical reinforcement learning framework wherein a high-level manager explores structural evolution and proposes candidate configurations, whereas a low-level worker optimizes parameters and provides performance rewards to steer the search towards high-performance regions. By integrating graph representation, thermophysical surrogate, and manager-worker learning, this method establishes a fully automated pipeline for encoding, decoding, and co-optimization. Using heat pump and heat engine cycles as case studies, the results demonstrate that the proposed method not only replicates classical cycle configurations but also identifies 18 and 21 novel heat pump and heat engine cycles, respectively. Relative to classical cycles, the novel configurations exhibit performance improvements of 4.6% and 133.3%, respectively, surpassing the traditional designs. This method effectively balances efficiency with broad applicability, providing a practical and scalable intelligent alternative to expert-driven thermodynamic cycle design.
[LG-62] Generalization Guarantees on Data-Driven Tuning of Gradient Descent with Langevin Updates
链接: https://arxiv.org/abs/2604.13130
作者: Saumya Goyal,Rohith Rongali,Ritabrata Ray,Barnabás Póczos
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study learning to learn for regression problems through the lens of hyperparameter tuning. We propose the Langevin Gradient Descent Algorithm (LGD), which approximates the mean of the posterior distribution defined by the loss function and regularizer of a convex regression task. We prove the existence of an optimal hyperparameter configuration for which the LGD algorithm achieves the Bayes’ optimal solution for squared loss. Subsequently, we study generalization guarantees on meta-learning optimal hyperparameters for the LGD algorithm from a given set of tasks in the data-driven setting. For a number of parameters d and hyperparameter dimension h , we show a pseudo-dimension bound of O(dh) , upto logarithmic terms under mild assumptions on LGD. This matches the dimensional dependence of the bounds obtained in prior work for the elastic net, which only allows for h=2 hyperparameters, and extends their bounds to regression on convex loss. Finally, we show empirical evidence of the success of LGD and the meta-learning procedure for few-shot learning on linear regression using a few synthetically created datasets.
[LG-63] Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal Velocity and Multi-Account Signals
链接: https://arxiv.org/abs/2604.13125
作者: Bhavana Sajja
类目: Machine Learning (cs.LG)
*备注: 28 pages, 5 figures. Submitted to DMLR (Journal of Data-centric Machine Learning Research). Code: this https URL
Abstract:We introduce behavioral fidelity – a third evaluation dimension for synthetic tabular data that measures whether generated data preserves the temporal, sequential, and structural behavioral patterns that distinguish real-world entity activity. Existing frameworks evaluate statistical fidelity (marginal distributions and correlations) and downstream utility (classifier AUROC on synthetic-trained models), but neither tests for the behavioral signals that operational detection and analysis systems actually rely on. We formalize a taxonomy of four behavioral fraud patterns (P1-P4) covering inter-event timing, burst structure, multi-account graph motifs, and velocity-rule trigger rates; define a degradation ratio metric calibrated to a real-data noise floor (1.0 = matches real variability, k = k-times worse); and prove that row-independent generators – the dominant paradigm – are structurally incapable of reproducing P3 graph motifs (Proposition 1) and produce non-positive within-entity IET autocorrelation (Proposition 2), making the positive burst fingerprint of fraud sequences unachievable regardless of architecture or training data size. We benchmark CTGAN, TVAE, GaussianCopula, and TabularARGN on IEEE-CIS Fraud Detection and the Amazon Fraud Dataset. All four fail severely: on IEEE-CIS composite degradation ratios range from 24.4x (TVAE) to 39.0x (GaussianCopula); on Amazon FDB, row-independent generators score 81.6-99.7x, while TabularARGN achieves 17.2x. We document generator-specific failure modes and their resolutions. The P1-P4 framework extends to any domain with entity-level sequential tabular data, including healthcare and network security. We release our evaluation framework as open source.
[LG-64] Exploring Urban Land Use Patterns by Pattern Mining and Unsupervised Learning
链接: https://arxiv.org/abs/2604.13050
作者: Zdena Dobesova,Tai Dinh,Pavel Novak
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:Urban areas are intricate systems shaped by socioeconomic, environmental, and infrastructural factors, with land use patterns serving as aspects of urban morphology. This paper proposes a novel methodology leveraging frequent item set mining and unsupervised learning techniques to identify similar cities based on co-occurring land use patterns. The Copernicus program’s Urban Atlas data are used as source data. The methodology involves data preprocessing, pattern mining using the negFIN algorithm, postprocessing, and knowledge extraction and visualization. The preprocessing of spatial datasets results in a publicly available transaction dataset. The framework is scalable and the source code is made publicly available.
[LG-65] Multistage Conditional Compositional Optimization
链接: https://arxiv.org/abs/2604.14075
作者: Buse Şen,Yifan Hu,Daniel Kuhn
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce Multistage Conditional Compositional Optimization (MCCO) as a new paradigm for decision-making under uncertainty that combines aspects of multistage stochastic programming and conditional stochastic optimization. MCCO minimizes a nest of conditional expectations and nonlinear cost functions. It has numerous applications and arises, for example, in optimal stopping, linear-quadratic regulator problems, distributionally robust contextual bandits, as well as in problems involving dynamic risk measures. The naïve nested sampling approach for MCCO suffers from the curse of dimensionality familiar from scenario tree-based multistage stochastic programming, that is, its scenario complexity grows exponentially with the number of nests. We develop new multilevel Monte Carlo techniques for MCCO whose scenario complexity grows only polynomially with the desired accuracy.
[LG-66] A Comparative Study of Dynamic Programming and Reinforcement Learning in Finite Horizon Dynamic Pricing
链接: https://arxiv.org/abs/2604.14059
作者: Lev Razumovskiy,Nikolay Karenin
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:
Abstract:This paper provides a systematic comparison between Fitted Dynamic Programming (DP), where demand is estimated from data, and Reinforcement Learning (RL) methods in finite-horizon dynamic pricing problems. We analyze their performance across environments of increasing structural complexity, ranging from a single typology benchmark to multi-typology settings with heterogeneous demand and inter-temporal revenue constraints. Unlike simplified comparisons that restrict DP to low-dimensional settings, we apply dynamic programming in richer, multi-dimensional environments with multiple product types and constraints. We evaluate revenue performance, stability, constraint satisfaction behavior, and computational scaling, highlighting the trade-offs between explicit expectation-based optimization and trajectory-based learning.
[LG-67] Stochastic Trust-Region Methods for Over-parameterized Models
链接: https://arxiv.org/abs/2604.14017
作者: Aike Yang,Hao Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 26 pages, 3 figures
Abstract:Under interpolation-type assumptions such as the strong growth condition, stochastic optimization methods can attain convergence rates comparable to full-batch methods, but their performance, particularly for SGD, remains highly sensitive to step-size selection. To address this issue, we propose a unified stochastic trust-region framework that eliminates manual step-size tuning and extends naturally to equality-constrained problems. For unconstrained optimization, we develop a first-order stochastic trust-region algorithm and show that, under the strong growth condition, it achieves an iteration and stochastic first-order oracle complexity of O(\varepsilon^-2 \log(1/\varepsilon)) for finding an \varepsilon -stationary point. For equality-constrained problems, we introduce a quadratic-penalty-based stochastic trust-region method with penalty parameter \mu , and establish an iteration and oracle complexity of O(\varepsilon^-4 \log(1/\varepsilon)) to reach an \varepsilon -stationary point of the penalized problem, corresponding to an O(\varepsilon) -approximate KKT point of the original constrained problem. Numerical experiments on deep neural network training and orthogonally constrained subspace fitting demonstrate that the proposed methods achieve performance comparable to well-tuned stochastic baselines, while exhibiting stable optimization behavior and effectively handling hard constraints without manual learning-rate scheduling.
[LG-68] Nested Fourier-enhanced neural operator for efficient modeling of radiation transfer in fires
链接: https://arxiv.org/abs/2604.13919
作者: Anran Jiao,Wengyao Jiang,Xiaoyi Lu,Yi Wang,Lu Lu
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Computational fluid dynamics (CFD) has become an essential tool for predicting fire behavior, yet maintaining both efficiency and accuracy remains challenging. A major source of computational cost in fire simulations is the modeling of radiation transfer, which is usually the dominant heat transfer mechanism in fires. Solving the high-dimensional radiative transfer equation (RTE) with traditional numerical methods can be a performance bottleneck. Here, we present a machine learning framework based on Fourier-enhanced multiple-input neural operators (Fourier-MIONet) as an efficient alternative to direct numerical integration of the RTE. We first investigate the performance of neural operator architectures for a small-scale 2D pool fire and find that Fourier-MIONet provides the most accurate radiative solution predictions. The approach is then extended to 3D CFD fire simulations, where the computational mesh is locally refined across multiple levels. In these high-resolution settings, monolithic surrogate models for direct field-to-field mapping become difficult to train and computationally inefficient. To address this issue, a nested Fourier-MIONet is proposed to predict radiation solutions across multiple mesh-refinement levels. We validate the approach on 3D McCaffrey pool fires simulated with FireFOAM, including fixed fire sizes and a unified model trained over a continuous range of heat release rates (HRRs). The proposed method achieves global relative errors of 2-4% for 3D varying-HRR scenarios while providing faster inference than the estimated cost of one finite-volume radiation solve in FireFOAM for the 16-solid-angle case. With fast and accurate inference, the surrogate makes higher-fidelity radiation treatments practical and enables the incorporation of more spectrally resolved radiation models into CFD fire simulations for engineering applications.
[LG-69] Sandpile Economics: Theory Identification and Evidence
链接: https://arxiv.org/abs/2604.13890
作者: Diego Vallarino
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Econometrics (econ.EM); Theoretical Economics (econ.TH); Machine Learning (stat.ML)
*备注:
Abstract:Why do capitalist economies recurrently generate crises whose severity is disproportionate to the size of the triggering shock? This paper proposes a structural answer grounded in the evolutionary geometry of production networks. As economies evolve through specialization, integration, and competitive selection, their inter-sectoral linkages drift toward configurations of increasing geometric fragility, eventually crossing a threshold beyond which small disturbances generate disproportionately large cascades. We introduce Sandpile Economics, a formal framework that interprets macroeconomic instability as an emergent property of disequilibrium production networks. The key state variable is the Forman–Ricci curvature of the input–output graph, capturing local substitution possibilities when supply chains are disrupted. We show that when curvature falls below an endogenous threshold, the distribution of cascade sizes follows a power law with tail index \alpha \in (1,2) , implying a regime of unbounded amplification. The underlying mechanism is evolutionary: specialization reduces input substitutability, pushing the economy toward criticality, while crisis episodes induce endogenous network reconfiguration and path dependence. These dynamics are inherently non-ergodic and cannot be captured by representative-agent frameworks. Empirically, using global input–output data, we document that production networks operate in persistently negative curvature regimes and that curvature robustly predicts medium-run output dynamics. A one-standard-deviation increase in curvature is associated with higher cumulative growth over three-year horizons, and curvature systematically outperforms standard network metrics in explaining cross-country differences in resilience. Subjects: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Econometrics (econ.EM); Theoretical Economics (econ.TH); Machine Learning (stat.ML) Cite as: arXiv:2604.13890 [physics.soc-ph] (or arXiv:2604.13890v1 [physics.soc-ph] for this version) https://doi.org/10.48550/arXiv.2604.13890 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Diego Vallarino Dr. [view email] [v1] Wed, 15 Apr 2026 13:57:16 UTC (50 KB)
[LG-70] Gradient Descents Last Iterate is Often (slightly) Suboptimal
链接: https://arxiv.org/abs/2604.13870
作者: Guy Kornowski,Ohad Shamir
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of \log T/\sqrtT after T steps. A breakthrough result of Jain et al. [2019] recovered the optimal 1/\sqrtT rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing T in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of T , no stepsize sequence can ensure the optimal error for SGD’s last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in T when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.
[LG-71] Covariance-adapting algorithm for semi-bandits with application to sparse rewards COLT
链接: https://arxiv.org/abs/2604.13738
作者: Pierre Perrault,Vianney Perchet,Michal Valko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at Conference on Learning Theory (COLT) 2020
Abstract:We investigate stochastic combinatorial semi-bandits, where the entire joint distribution of outcomes impacts the complexity of the problem instance (unlike in the standard bandits). Typical distributions considered depend on specific parameter values, whose prior knowledge is required in theory but quite difficult to estimate in practice; an example is the commonly assumed sub-Gaussian family. We alleviate this issue by instead considering a new general family of sub-exponential distributions, which contains bounded and Gaussian ones. We prove a new lower bound on the expected regret on this family, that is parameterized by the unknown covariance matrix of outcomes, a tighter quantity than the sub-Gaussian matrix. We then construct an algorithm that uses covariance estimates, and provide a tight asymptotic analysis of the regret. Finally, we apply and extend our results to the family of sparse outcomes, which has applications in many recommender systems.
[LG-72] Reachability Constraints in Variational Quantum Circuits: Optimization within Polynomial Group Module
链接: https://arxiv.org/abs/2604.13735
作者: Yun-Tak Oh,Dongsoo Lee,Jungyoul Park,Kyung Chul Jeong,Panjin Kim
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 27 pages, 4 figures, appendix
Abstract:This work identifies a necessary condition for any variational quantum approach to reach the exact ground state. Briefly, the norms of the projections of the input and the ground state onto each group module must match, implying that module weights of the solution state have to be known in advance in order to reach the exact ground state. An exemplary case is provided by matchgate circuits applied to problems whose solutions are classical bit strings, since all computational basis states share the same module-wise weights. Combined with the known classical simulability of quantum circuits for which observables lie in a small linear subspace, this implies that certain problems admit a classical surrogate for exact solution with each step taking O(n^5) time. The Maximum Cut problem serves as an illustrative example.
[LG-73] VIGILant: an automatic classification pipeline for glitches in the Virgo detector
链接: https://arxiv.org/abs/2604.13687
作者: Tiago Fernandes,Francesco Di Renzo,Antonio Onofre,Alejandro Torres-Forné,José A. Font
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:
Abstract:Glitches frequently contaminate data in gravitational-wave detectors, complicating the observation and analysis of astrophysical signals. This work introduces VIGILant, an automatic pipeline for classification and visualization of glitches in the Virgo detector. Using a curated dataset of Virgo O3b glitches, two machine learning approaches are evaluated: tree-based models (Decision Tree, Random Forest and XGBoost) using structured Omicron parameters, and Convolutional Neural Networks (ResNet) trained on spectrogram images. While tree-based models offer higher interpretability and fast training, the ResNet34 model achieved superior performance, reaching a F1 score of 0.9772 and accuracy of 0.9833 in the testing set, with inference times of tens of milliseconds per glitch. The pipeline has been deployed for daily operation at the Virgo site since observing run O4c, providing the Virgo collaboration with an interactive dashboard to monitor glitch populations and detector behavior. This allows to identify low-confidence predictions, highlighting glitches requiring further attention.
[LG-74] node2vec or triangle-biased random walks: stationarity regularity recurrence
链接: https://arxiv.org/abs/2604.13681
作者: Luca Avena,Gianmarco Bet,Lars Schroeder,Clara Stegehuis
类目: Probability (math.PR); Machine Learning (cs.LG)
*备注: 24 pages, 4 figures
Abstract:The node2vec random walk is a non-Markovian random walk on the vertex set of a graph, widely used for network embedding and exploration. This random walk model is defined in terms of three parameters which control the probability of, respectively, backtracking moves, moves within triangles, and moves to the remaining neighboring nodes. From a mathematical standpoint, the node2vec random walk is a nontrivial generalization of the non-backtracking random walk and thus belongs to the class of second-order Markov chains. Despite its widespread use in applications, little is known about its long-run behavior. The goal of this paper is to begin exploring its fundamental properties on arbitrary graphs. To this aim, we show how lifting the node2vec random walk to the state spaces of directed edges and directed wedges yields two distinct Markovian representations which are key for its asymptotic analysis. Using these representations, we find mild sufficient conditions on the underlying finite or infinite graph to guarantee ergodicity, reversibility, recurrence and characterization of the invariant measure. As we discuss, the behavior of the node2vec random walk is drastically different compared to the non-backtracking random walk. While the latter simplifies on arbitrary graphs when using its natural edge Markovian representation thanks to bistochasticity, the former simplifies on regular graphs when using its natural wedge Markovian representation. Remarkably, this representation reveals that a graph is regular if and only if a certain weighted Eulerianity condition holds.
[LG-75] Irregularly Sampled Time Series Interpolation for Binary Evolution Simulations Using Dynamic Time Warping
链接: https://arxiv.org/abs/2604.13604
作者: Ugur Demir,Philipp M. Srivastava,Aggelos Katsaggelos,Vicky Kalogera,Santiago L. Tapia,Manuel Ballester,Shamal Lalvani,Patrick Koller,Jeff J. Andrews,Seth Gossage,Max M. Briel,Elizabeth Teng
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 25 pages, 11 figures. Submitted to ApJ
Abstract:Binary stellar evolution simulations are computationally expensive. Stellar population synthesis relies on these detailed evolution models at a fundamental level. Producing thousands of such models requires hundreds of CPU hours, but stellar track interpolation provides one approach to significantly reduce this computational cost. Although single-star track interpolation is straightforward, stellar interactions in binary systems introduce significant complexity to binary evolution, making traditional single-track interpolation methods inapplicable. Binary tracks present fundamentally different challenges compared to single stars, which possess relatively straightforward evolutionary phases identifiable through distinct physical properties. Binary systems are complicated by mutual interactions that can dramatically alter evolutionary trajectories and introduce discontinuities difficult to capture through standard interpolation. In this work, we introduce a novel approach for track alignment and iterative track averaging based on Dynamic Time Warping to address misalignments between neighboring tracks. Our method computes a single shared warping path across all physical parameters simultaneously, placing them on a consistent temporal grid that preserves the causal relationships between parameters. We demonstrate that this joint-alignment strategy maintains key physical relationships such as the Stefan-Boltzmann law in the interpolated tracks. Our comprehensive evaluation across multiple binary configurations demonstrates that proper temporal alignment is crucial for track interpolation methods. The proposed method consistently outperforms existing approaches and enables the efficient generation of more accurate binary population samples for astrophysical studies.
[LG-76] Data-driven Learning of Probabilistic Model of Binary Droplet Collision for Spray Simulation
链接: https://arxiv.org/abs/2604.13594
作者: Weiming Xu,Tao Yang,Peng Zhang
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 28 pages, 11 figures, research paper
Abstract:Binary droplet collisions are ubiquitous in dense sprays. Traditional deterministic models cannot adequately represent transitional and stochastic behaviors of binary droplet collision. To bridge this gap, we developed a probabilistic model by using a machine learning approach, the Light Gradient-Boosting Machine (LightGBM). The model was trained on a comprehensive dataset of 33,540 experimental cases covering eight collision regimes across broad ranges of Weber number, Ohnesorge number, impact parameter, size ratio, and ambient pressure. The resulting machine learning classifier captures highly nonlinear regime boundaries with 99.2% accuracy and retains sensitivity in transitional regions. To facilitate its implementation in spray simulation, the model was translated into a probabilistic form, a multinomial logistic regression, which preserves 93.2% accuracy and maps continuous inter-regime transitions. A biased-dice sampling mechanism then converts these probabilities into definite yet stochastic outcomes. This work presents the first probabilistic, high-dimensional droplet collision model derived from experimental data, offering a physically consistent, comprehensive, and user-friendly solution for spray simulation.
[LG-77] Robust Low-Rank Tensor Completion based on M-product with Weighted Correlated Total Variation and Sparse Regularization
链接: https://arxiv.org/abs/2604.13525
作者: Biswarup Karmakar,Ratikanta Behera
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 32 pages
Abstract:The robust low-rank tensor completion problem addresses the challenge of recovering corrupted high-dimensional tensor data with missing entries, outliers, and sparse noise commonly found in real-world applications. Existing methodologies have encountered fundamental limitations due to their reliance on uniform regularization schemes, particularly the tensor nuclear norm and \ell_1 norm regularization approaches, which indiscriminately apply equal shrinkage to all singular values and sparse components, thereby compromising the preservation of critical tensor structures. The proposed tensor weighted correlated total variation (TWCTV) regularizer addresses these shortcomings through an M -product framework that combines a weighted Schatten- p norm on gradient tensors for low-rankness with smoothness enforcement and weighted sparse components for noise suppression. The proposed weighting scheme adaptively reduces the thresholding level to preserve both dominant singular values and sparse components, thus improving the reconstruction of critical structural elements and nuanced details in the recovered signal. Through a systematic algorithmic approach, we introduce an enhanced alternating direction method of multipliers (ADMM) that offers both computational efficiency and theoretical substantiation, with convergence properties comprehensively analyzed within the M -product this http URL numerical evaluations across image completion, denoising, and background subtraction tasks validate the superior performance of this approach relative to established benchmark methods.
[LG-78] Joint Representation Learning and Clustering via Gradient-Based Manifold Optimization
链接: https://arxiv.org/abs/2604.13484
作者: Sida Liu,Yangzi Guo,Mingyuan Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Clustering and dimensionality reduction have been crucial topics in machine learning and computer vision. Clustering high-dimensional data has been challenging for a long time due to the curse of dimensionality. For that reason, a more promising direction is the joint learning of dimension reduction and clustering. In this work, we propose a Manifold Learning Framework that learns dimensionality reduction and clustering simultaneously. The proposed framework is able to jointly learn the parameters of a dimension reduction technique (e.g. linear projection or a neural network) and cluster the data based on the resulting features (e.g. under a Gaussian Mixture Model framework). The framework searches for the dimension reduction parameters and the optimal clusters by traversing a manifold,using Gradient Manifold Optimization. The obtained The proposed framework is exemplified with a Gaussian Mixture Model as one simple but efficient example, in a process that is somehow similar to unsupervised Linear Discriminant Analysis (LDA). We apply the proposed method to the unsupervised training of simulated data as well as a benchmark image dataset (i.e. MNIST). The experimental results indicate that our algorithm has better performance than popular clustering algorithms from the literature.
[LG-79] Estimating Continuous Treatment Effects with Two-Stage Kernel Ridge Regression
链接: https://arxiv.org/abs/2604.13410
作者: Seok-Jin Kim,Kaizheng Wang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the problem of estimating the effect function for a continuous treatment, which maps each treatment value to a population-averaged outcome. A central challenge in this setting is confounding: treatment assignment often depends on covariates, creating selection bias that makes direct regression of the response on treatment unreliable. To address this issue, we propose a two-stage kernel ridge regression method. In the first stage, we learn a model for the response as a function of both treatment and covariates; in the second stage, we use this model to construct pseudo-outcomes that correct for distribution shift, and then fit a second model to estimate the treatment effect. Although the response varies with both treatment and covariates, the induced effect function obtained by averaging over covariates is typically much simpler, and our estimator adapts to this structure. Furthermore, we introduce a fully data-driven model selection procedure that achieves provable adaptivity to both the unknown degree of overlap and the regularity (eigenvalue decay) of the underlying kernel.
[LG-80] A short proof of near-linear convergence of adaptive gradient descent under fourth-order growth and convexity
链接: https://arxiv.org/abs/2604.13393
作者: Damek Davis,Dmitriy Drusvyatskiy
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Davis, Drusvyatskiy, and Jiang showed that gradient descent with an adaptive stepsize converges locally at a nearly-linear rate for smooth functions that grow at least quartically away from their minimizers. The argument is intricate, relying on monitoring the performance of the algorithm relative to a certain manifold of slow growth – called the ravine. In this work, we provide a direct Lyapunov-based argument that bypasses these difficulties when the objective is in addition convex and a has a unique minimizer. As a byproduct of the argument, we obtain a more adaptive variant than the original algorithm with encouraging numerical performance.
[LG-81] AeTHERON: Autoregressive Topology-aware Heterogeneous Graph Operator Network for Fluid-Structure Interaction
链接: https://arxiv.org/abs/2604.13369
作者: Sushrut Kumar
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Surrogate modeling of body-driven fluid flows where immersed moving boundaries couple structural dynamics to chaotic, unsteady fluid phenomena remains a fundamental challenge for both computational physics and machine learning. We present AeTHERON, a heterogeneous graph neural operator whose architecture directly mirrors the structure of the sharp-interface immersed boundary method (IBM): a dual-graph representation separating fluid and structural domains, coupled through sparse cross-attention that reflects the compact support of IBM interpolation stencils. This physics-informed inductive bias enables AeTHERON to learn nonlinear fluid-structure coupling in a shared high-dimensional latent space, with continuous sinusoidal time embeddings providing temporal generalization across lead times. We evaluate AeTHERON on direct numerical simulations of a flapping flexible caudal fin, a canonical FSI benchmark featuring leading-edge vortex formation, large membrane deformation, and chaotic wake shedding across a 4x5 parameter grid of membrane thickness (h* = 0.01-0.04) and Strouhal number (St = 0.30-0.50). As a proof-of-concept, we train on the first 150 timesteps of a representative case using a 70/30 train/validation split and evaluate on the fully unseen extrapolation window t=150-200. AeTHERON captures large-scale vortex topology and wake structure with qualitative fidelity, achieving a mean extrapolation MAE of 0.168 without retraining, with error peaking near flapping half-cycle transitions where flow reorganization is most rapid – a physically interpretable pattern consistent with the nonlinear fluid-membrane coupling. Inference requires milliseconds per timestep on a single GPU versus hours for equivalent DNS computation. This is a continuously developing preprint; results and figures will be updated in subsequent versions.
[LG-82] Rare Event Analysis via Stochastic Optimal Control
链接: https://arxiv.org/abs/2604.13213
作者: Yuanqi Du,Jiajun He,Dinghuai Zhang,Eric Vanden-Eijnden,Carles Domingo-Enrich
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Rare events such as conformational changes in biomolecules, phase transitions, and chemical reactions are central to the behavior of many physical systems, yet they are extremely difficult to study computationally because unbiased simulations seldom produce them. Transition Path Theory (TPT) provides a rigorous statistical framework for analyzing such events: it characterizes the ensemble of reactive trajectories between two designated metastable states (reactant and product), and its central object–the committor function, which gives the probability that the system will next reach the product rather than the reactant–encodes all essential kinetic and thermodynamic information. We introduce a framework that casts committor estimation as a stochastic optimal control (SOC) problem. In this formulation the committor defines a feedback control–proportional to the gradient of its logarithm–that actively steers trajectories toward the reactive region, thereby enabling efficient sampling of reactive paths. To solve the resulting hitting-time control problem we develop two complementary objectives: a direct backpropagation loss and a principled off-policy Value Matching loss, for which we establish first-order optimality guarantees. We further address metastability, which can trap controlled trajectories in intermediate basins, by introducing an alternative sampling process that preserves the reactive current while lowering effective energy barriers. On benchmark systems, the framework yields markedly more accurate committor estimates, reaction rates, and equilibrium constants than existing methods.
[LG-83] HUANet: Hard-Constrained Unrolled ADMM for Constrained Convex Optimization
链接: https://arxiv.org/abs/2604.13179
作者: Trinh Tran,Binh Nguyen,Truong X. Nghiem
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:This paper presents HUANet, a constrained deep neural network architecture that unrolls the iterations of the Alternating Direction Method of Multipliers (ADMM) into a trainable neural network for solving constrained convex optimization problems. Existing end-to-end learning methods operate as black-box mappings from parameters to solutions, often lacking explicit optimality principles and failing to enforce constraints. To address this limitation, we unroll ADMM and embed a hard-constrained neural network at each iteration to accelerate the algorithm, where equality constraints are enforced via a differentiable correction stage at the network output. Furthermore, we incorporate first-order optimality conditions as soft constraints during training to promote the convergence of the proposed unrolled algorithm. Extensive numerical experiments are conducted to validate the effectiveness of the proposed architecture for constrained optimization problems.
[LG-84] Adaptive Learning via Off-Model Training and Importance Sampling for Fully Non-Markovian Optimal Stochastic Control. Complete version
链接: https://arxiv.org/abs/2604.13147
作者: Dorival Leão,Alberto Ohashi,Simone Scotti,Adolfo M.D da Silva
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 74 pages, 3 figures
Abstract:This paper studies continuous-time stochastic control problems whose controlled states are fully non-Markovian and depend on unknown model parameters. Such problems arise naturally in path-dependent stochastic differential equations, rough-volatility hedging, and systems driven by fractional Brownian motion. Building on the discrete skeleton approach developed in earlier work, we propose a Monte Carlo learning methodology for the associated embedded backward dynamic programming equation. Our main contribution is twofold. First, we construct explicit dominating training laws and Radon–Nikodym weights for several representative classes of non-Markovian controlled systems. This yields an off-model training architecture in which a fixed synthetic dataset is generated under a reference law, while the dynamic programming operators associated with a target model are recovered by importance sampling. Second, we use this structure to design an adaptive update mechanism under parametric model uncertainty, so that repeated recalibration can be performed by reweighting the same training sample rather than regenerating new trajectories. For fixed parameters, we establish non-asymptotic error bounds for the approximation of the embedded dynamic programming equation via deep neural networks. For adaptive learning, we derive quantitative estimates that separate Monte Carlo approximation error from model-risk error. Numerical experiments illustrate both the off-model training mechanism and the adaptive importance-sampling update in structured linear-quadratic examples.
附件下载


