本篇博文主要内容为 2026-06-19 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-06-19)
今日共更新666篇论文,其中:
- 自然语言处理共90篇(Computation and Language (cs.CL))
- 人工智能共220篇(Artificial Intelligence (cs.AI))
- 计算机视觉共124篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共202篇(Machine Learning (cs.LG))
- 多智能体系统共19篇(Multiagent Systems (cs.MA))
- 信息检索共18篇(Information Retrieval (cs.IR))
- 人机交互共21篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems
【速读】:该论文旨在解决在多智能体系统中,当大型语言模型(LLM)作为评估者时,其系统性评估偏差会通过智能体网络传播的问题。其核心解决方案是提出“传染网络”(Contagion Networks)这一形式化框架,用于量化评估者偏差在交互式LLM智能体之间的传播程度。关键发现包括:在基于DeepSeek-chat的三智能体控制实验中,不同评估偏差模式(结构化、均衡、基于证据)均导致偏差在智能体间持续传播(传播系数γ ∈ [0.157, 0.352]),且传播行为受传播矩阵的谱半径ρ(Γ_N)调控,呈现出三种不同的传播机制。研究进一步揭示,同质模型智能体间的传染系数仅为异质模型情形(如先前MM-EPC研究中γ ≈ 0.85–1.3)的1/3至1/5,处于抑制传播区;同时,将评估委员会规模从k=1扩大至k=3可使有效传染程度降低72.4%,提供了一种可操作的缓解策略。研究开源了完整的传染网络实验框架,为后续研究提供了基础工具。
链接: https://arxiv.org/abs/2606.20493
作者: Zewen Liu
机构: Qilu Institute of Technology, School of Software Engineering (齐鲁理工学院软件工程学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 20 pages, 4 figures, 4 tables
Abstract:When large language models serve as evaluators in multi-agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat with three distinct evaluator bias profiles (structured, balanced, evidence-based), we measure the Cross-Agent Contagion Matrix Gamma_3 and find that evaluator biases consistently propagate between agents (gamma in [0.157, 0.352]), even within the same underlying model. We identify three propagation regimes governed by the spectral radius rho(Gamma_N), and demonstrate that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model coefficients observed in prior work (MM-EPC: gamma approx 0.85-1.3), placing them in the suppression regime. We show that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, providing an actionable mitigation strategy. We release the open-source Contagion Network experimental framework.
[MA-1] An Infrastructure-less Control-Independent Solution to Relative Localisation of a Team of Mobile Robots using Ranging Measurements
【速读】:该论文旨在解决在缺乏固定基础设施、部署需快速灵活且系统资源要求极低的场景下,多机器人团队的协同定位问题。其核心挑战在于如何在不依赖外部锚点、无需主动控制机器人运动以保证可观测性的情况下,实现高鲁棒性的分布式定位。解决方案的关键在于提出一种无锚点(anchor-less)、完全去中心化的协同定位算法,仅依赖本地里程计、稀疏的异构间测距数据以及短距离通信,这些均为实际机器人系统中普遍具备的感知与通信能力。该算法采用多假设贝叶斯框架,能够维护所有可行解的集合,在瞬态不可观测条件下仍保持系统鲁棒性;同时通过信息共享机制,即使网络连接部分中断,每个智能体也能受益于全队的估计信息,从而实现全局状态的一致性与可靠性。
链接: https://arxiv.org/abs/2606.20365
作者: Paolo Golinelli,Tommaso Faraci,Daniele Fontanelli
机构: University of Trento (特伦托大学); Department of Industrial Engineering (工业工程系); Department of Information Engineering and Computer Science (信息工程与计算机科学系)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:The ability to localise teams of robots is essential for applications ranging from robotic fleets in unstructured environments to cooperative control and navigation tasks. In such contexts, fixed infrastructure is often unavailable, deployments must be fast and flexible, and system requirements must be minimal. We present a decentralised cooperative localisation algorithm that addresses all these challenges at once. The method is anchor-less, fully decentralised, and, unlike most existing approaches, does not require controlling the robots motion to ensure team observability. It relies only on local odometry, sparse inter-agent ranging measurements, and short-range communication, all of which are widely available in practice. The algorithm adopts a multi-hypothesis Bayesian framework that maintains the entire set of feasible solutions, ensuring robustness under transient unobservable conditions. Moreover, through information sharing, each agent benefits from the estimates of the entire group, even in partially connected conditions.
[MA-2] Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLM s
【速读】:该论文旨在解决自动化修复GitHub问题过程中存在的可靠性与安全性问题,特别是在从问题分类(triage)到拉取请求(Pull Request, PR)创建的全流程中,如何确保生成代码的正确性、可追溯性及系统稳定性。其核心挑战在于:在端到端自动化流程中,若缺乏充分的验证机制和安全防护,易引入回归错误或不恰当的代码变更。为此,论文提出Phoenix系统,采用六类专业化智能体(Planner、Reproducer、Coder、Tester、Failure Analyst、PR Agent)协同工作,并通过基于标签的GitHub webhook状态机进行调度。解决方案的关键在于构建了七层安全控制体系,结合基线感知的测试评估策略——即所有代码变更在提交拉取请求前均需通过与基准测试运行结果的比对验证,从而有效避免“通过即失败”(pass-to-pass regression)的问题。实验表明,在SWE-bench Lite的一个24实例子集上,Phoenix实现了75%的实例成功闭环修复且无回归错误;而在42个真实问题的试点中,正确性保持率(Correctness Preservation, CP)达到100%,平均处理时间122秒(硬级难度)。尽管部分拉取请求因规划器定位偏差导致代码放置路径错误,但该问题已被识别并正通过引入检索增强机制加以改进。此外,系统设计还针对性地应对了部署过程中的典型失败模式,如WAF过滤、令牌过期、权限边界限制及CI流水线不稳定等,这些实际挑战直接驱动了各层安全机制的设计。
链接: https://arxiv.org/abs/2606.20243
作者: Kipngeno Koech,Muhammad Adam,Baimam Boukar Jean Jacques,Joao Barros
机构: 未知
类目: oftware Engineering (cs.SE); Multiagent Systems (cs.MA)
备注:
Abstract:We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label-based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24-instance slice of SWE-bench Lite. run on the production webhook path, Phoenix oracle-resolves 75% of instances with no pass-to-pass regressions on successful runs; this curated slice is not directly comparable to full-split leaderboard results, and we discuss the limits of the comparison. A complementary pilot on 42 real issues across 14 repositories yields 100% correctness preservation (CP; mean 122s on the hard tier). Manual inspection shows that about half of the resulting pull requests are well-targeted fixes. The other half place code at incorrect paths, a planner localization limitation we are addressing with retrieval. We also report the deployment failure modes (WAF filtering, token expiry, permission boundaries, flaky CI) that motivated each safety mechanism.
[MA-3] A Multi-Agent system for Multi-Objective constrained optimization AAMAS2026
【速读】:该论文旨在解决动态环境中基于强化学习(Reinforcement Learning, RL)的约束优化问题中,因权重手动设定导致的主目标优化与约束违反规避之间难以平衡的问题。在传统方法中,成本最小化与性能约束通过拉格朗日启发式的加权惩罚项合并为单一标量奖励,但策略行为高度依赖于人为设定的权重,尤其在非平稳环境中,各目标相对重要性动态变化,手动调参难以适应。为此,本文提出MAMO(多智能体多目标约束优化系统),其核心创新在于将奖励权重的选择问题转化为一个可学习的多智能体强化学习任务,通过解耦任务执行与目标设计,实现对权重的自动调整,从而提升算法在动态环境下的自主性与鲁棒性,为复杂约束优化问题提供了更自适应的解决方案。
链接: https://arxiv.org/abs/2606.20236
作者: Federica Filippini
机构: University of Milano-Bicocca (米兰大学博科尼校区)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Presented at the 17th Workshop on Optimization and Learning in Multiagent Systems (OptLearnMAS, this https URL ), co-located with the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Abstract:Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.
[MA-4] RACL: Reasoning -Agent Control Layers for Continuous Metaheuristic Learning
【速读】:该论文旨在解决元启发式算法(metaheuristics)在复杂优化问题中缺乏自适应控制机制的问题,尤其针对算法在搜索过程中陷入局部最优或停滞状态时难以自主调整策略的缺陷。其核心挑战在于如何在不改变原有优化器结构与业务约束的前提下,实现对搜索行为的动态、可解释性控制。解决方案的关键在于提出一种推理代理控制层(Reasoning-Agent Control Layer, RACL),该层通过部署一个独立的推理代理(reasoning agent),基于对操作内存(operational memory)的观测,分析历史搜索行为,生成有限范围的假设,测试干预措施,评估结果,并结合约束机制(guardrails)固化有效策略,最终实现对元启发式算法控制规则的发现、验证、整合与解释。实验以车辆路径规划(vehicle routing)为场景,验证了RACL在21个可行案例中均优于或等同于操作记忆策略,在18个案例中优于非推理型停滞触发策略(Stagnation-Triggered Policy, STP),平均成本降低0.641%;在Sevilla-9/10数据集上,相较于固定策略和STP分别实现8.337%和1.605%的成本下降,且未引入显著计算开销。研究中采用Codex作为实时推理代理进行在线干预建议,后期则通过策略代理(policy proxy)确保评估的可复现性,凸显了该方法在提升算法性能的同时兼顾透明性与实用性。
链接: https://arxiv.org/abs/2606.20142
作者: Antón Asla Manzárraga
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages, 5 tables
Abstract:This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer’s internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisions. The experiment uses vehicle routing as a testbed, but the contribution is not a new routing solver, a particular ALNS configuration or a specific set of routing rules. The contribution is the RACL method: a way for a reasoning agent to discover, validate, consolidate and explain algorithmic control rules for a metaheuristic. In the current experimental setting, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL vs STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without showing material computational overhead. During the proof-of-concept, Codex was used as an in-the-loop reasoning agent observing executions, interpreting logs and proposing live bounded interventions. The policy proxy was later used only to make quantitative evaluation reproducible.
[MA-5] ScaffoldAgent : Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research
【速读】:该论文旨在解决开放域深度研究(OEDR)中报告结构框架(outline)在多轮检索与生成过程中因静态设定或局部启发式调整导致的“结构漂移”问题,以及由此引发的反馈延迟和评估不及时等挑战。其核心解决方案是提出ScaffoldAgent——一种基于效用引导的动态大纲优化框架。该框架将大纲演化建模为一个包含扩展(Expansion)、收缩(Contraction)和修订(Revision)三类操作的结构化决策过程,实现对报告骨架的可控更新;同时引入效用引导的反馈机制,综合评估每次大纲操作在检索增益、结构连贯性及试生成质量方面的下游价值,从而指导节点选择、操作调度与推理终止。实验结果表明,ScaffoldAgent在DeepResearch Bench和DeepResearch Gym基准上显著提升了长篇报告生成的质量与事实准确性。
链接: https://arxiv.org/abs/2606.20122
作者: Zhibang Yang,Xinke Jiang,Yuzhen Xiao,Ruizhe Zhang,Yue Fang,XinFei Wan,Zhengxing Song,Yuxuan Liu,Yuheng Huang,Xu Chu,Junfeng Zhao,Yasha Wang
机构: Peking University (北京大学); GRG Banking Equipment Co., Ltd. (广州广电银通金融电子科技有限公司)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 6 figures
Abstract:Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.
[MA-6] Blame is easier than praise: Measuring off-ball defensive performance in football
【速读】:该论文旨在解决足球防守表现评估中长期存在的局限性问题,即现有评价体系主要依赖有限的离散动作(如抢断和拦截),而忽视了球员在连续位置行为中对防守态势的动态影响。其核心挑战在于如何在缺乏个体层面真实标签(player-level ground truth labels)的情况下,对多智能体时空轨迹中的防守责任进行可解释的归因分析。解决方案的关键在于提出一种基于防守压力区域(Defensive Pressure Areas, DPAs)计算球员参与度得分(player involvement scores)的框架,并结合自动识别的团队结构构建角色条件下的基线期望,从而量化每位防守者在任意传球过程中对预期威胁生成的责任。通过在一个涵盖跨性别、跨赛事的大型数据集(包括男足世界杯64场、女足德甲116场及男足德丙336场)上验证,该方法显著提升了评估有效性——相比最优的动作类指标,其有效性提升约一个标准差,且揭示了诸多主流指标的有效性不足。特别地,“失球高价值动作的责任归属”与外部评分及市场估值呈现强相关性,使该指标成为首个能可靠衡量位置失误的公开足球评估工具。研究还设计了一套融合多重弱代理指标的稳健评估协议,以应对无真实标签的现实困境,所有代码均已开源,保障研究可复现性与后续拓展。
链接: https://arxiv.org/abs/2606.19931
作者: Jonas Bischofberger,Runqing Ma,Pascal Bauer,Kilian Arnsmeyer,Arnold Baca
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:The defensive performance of football players is commonly measured through a limited number of actions like tackles and interceptions while their continuous impact through positional behaviour has hardly been studied before. We formulate this problem as an attribution over multi-agent spatiotemporal trajectories without player-level ground truth labels, where event-level changes of expected threat are distributed among individuals. We propose a framework that performs this attribution using player involvement scores calculated from defensive pressure areas (DPAs). By computing role-conditioned baselines within automatically detected team structures, we can determine each defender’s expected responsibility for threat created through arbitrary passes. The validity and robustness of this approach are evaluated on a uniquely extensive cross-gender and cross-competition data set, including positional and event data from 64 matches of the men’s World Cup, 116 matches of the women’s German Bundesliga and 336 matches of the men’s German 3. Liga. In the absence of a ground truth, we propose an evaluation protocol that combines multiple relatively weak proxies into robust summary scores. We find a validity score that is improved by around 1 standard deviation compared to the best action-based metric and demonstrate that many popular measures show limited validity. The “blame” for conceding high-value actions shows especially strong correlations with external ratings and market values, making it the first published metric in football to reliably measure positioning errors. All code underlying this work is publicly available to support reproducibility and further research.
[MA-7] Deep-Unfolded Coordination
【速读】:该论文旨在解决多智能体机器人系统中分布式优化方法在实际应用时面临的超参数调优难题,尤其是针对非凸优化器(如ADMM-DDP)在求解过程中对惩罚参数等超参数高度敏感、需依赖大量人工经验进行调优的问题。其核心解决方案是提出一种名为Deep Coordinator的深度展开(deep-unfolding)框架,通过在求解阶段动态学习并自适应调整ADMM-DDP算法的超参数,以响应优化器的实时性能表现。该框架将固定次数的ADMM-DDP迭代过程展开为一个可学习的神经网络结构,各层间引入可训练函数,实现从当前优化状态到下一迭代超参数的端到端映射。值得注意的是,该工作首次实现了在求解阶段对非凸优化器惩罚参数的动态自适应;针对主流监督学习在训练此类模型时易导致退化解的问题,提出了无监督学习方案。实验表明,在车辆与四旋翼无人机集群的仿真任务中,Deep Coordinator生成的轨迹质量与传统求解器相当,但求解速度提升6.18至9.44倍,并且在部署于比训练规模大8倍的系统时仍保持显著性能优势。
链接: https://arxiv.org/abs/2606.19920
作者: Hunter Kuperman,Minchan Jung,Rahul V. Ghosh,Alex Oshin,Evangelos A. Theodorou
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: The second and third authors contributed equally (equal second authorship). 35 pages (10 pages main text), 17 figures, 3 tables
Abstract:Distributed optimization is a highly scalable and structurally transparent technique to solve multi-agent robotics problems; however, such methods often suffer from the need for highly-specialized, problem-specific hyperparameter tunings. In this work, we propose Deep Coordinator, a deep-unfolding framework that learns to dynamically adjust the hyperparameters of ADMM-DDP, a popular distributed solver for robotics tasks, at solve-time in response to optimizer performance. Our architecture consists of unrolling a fixed number of ADMM-DDP iterations into a neural network with learnable functions between layers mapping the optimizer state to the next hyperparameters. To the best of our knowledge, Deep Coordinator is the first deep-unfolding framework to adapt the penalty parameters of a non-convex optimizer at solve-time; we show that the mainstream supervised approach can yield degenerate solutions when training such models, and propose an unsupervised learning scheme. On simulations with fleets of cars and quadrotors, Deep Coordinator produces trajectories of comparable quality 6.18-9.44x faster than conventional solvers. Furthermore, Deep Coordinator retains its performance benefits when deployed to systems up to 8x larger than trained on.
[MA-8] Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains Replacement Costs and Resilience
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在异质性辩论场景中,不同模型间交互所引发的“纠错”与“对抗性影响”之间的权衡问题。核心问题是:在由多个异质模型组成的辩论小组中,诚实的异质同伴是否能有效抑制错误修正行为(harmful revision),而恶意的异质同伴是否会加剧有害修改?其解决方案的关键在于通过对比同质基线、诚实异质混合组与对抗性异质混合组的修订行为,量化异质性对模型决策修正的影响。研究发现,引入一个诚实的异质同伴可显著降低有害修订率(如Llama-3.1-70B在MATH-hard任务上从89%降至35%),而恶意同伴则逆转此效果;更关键的是,在已有恶意同家族同伴存在的情况下,引入诚实异质同伴仍能同时减少有害修订和初始正确答案的丢失(如翻转率从31%降至6%),表明异质性不仅是攻击面,也可作为防御机制。这一结果揭示了异质性在复杂推理场景中的双重角色——既可能带来风险,也可能成为提升鲁棒性的关键策略。
链接: https://arxiv.org/abs/2606.19826
作者: Prashanti Nilayam,Kiran Kumar Ramanna,Prashil Tumbade,Sankalp Nayak
机构: ServiceNow
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:
Abstract:Heterogeneous LLM debate is motivated by the promise that diverse peers correct one another, but the same exchange that carries correction also carries adversarial influence. We measure which dominates by tracking how a heterogeneous peer changes the honest agents’ revision behavior: how often they change their answer, and whether the change is corrective or harmful. We compare matched panels (homogeneous baseline, honest-mixed, and adversarial-mixed) and contaminated panels in which a malicious same-family peer is already present, spanning four model families and three reasoning benchmarks. An honest heterogeneous peer sharply lowers harmful revision, and an adversarial one reverses it. For Llama-3.1-70B defenders on MATH-hard, the honest-slot harmful-revision rate falls from 89% in the homogeneous panel to 35% with an honest peer, and an adversarial peer returns it to 90%. The conditional rate hides this damage on weak defenders, but the end-of-debate flip rate exposes it. The pattern keeps its sign across families and benchmarks while its magnitude varies with the defender-benchmark regime. We also measure the effects when an adversarial same-family peer is already present: an honest heterogeneous peer lowers both harmful revision and the rate at which initially-correct answers are lost. On the same Llama-3.1-70B setting, the added honest peer cuts the flip rate on initially-correct items from 31% under a same-family adversary to 6%. Heterogeneity is therefore not only an attack surface but, when an adversary is already present, also a defense.
[MA-9] SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design EMNLP2026
【速读】:该论文旨在解决现有基于图的多智能体系统(Multi-Agent System, MAS)设计中,因智能体节点为固定封闭集合而导致难以泛化至未见过的能力组合任务的问题。传统方法通过优化预定义智能体、角色或分组间的通信拓扑来提升协作效率,但受限于静态的智能体结构,无法灵活应对新任务中的动态能力需求。为此,论文提出SIGMA框架,其核心创新在于将智能体建模为任务条件下的可复用技能组合体(skill-incidence graph),通过预测技能-智能体关联矩阵,从技能库中动态构建具备任务适应性的智能体节点,并基于所选技能生成智能体嵌入表示,进而解码出相应的通信拓扑。在执行阶段,采用基于技能的邮箱机制实现消息的精准路由,使技能与智能体之间的关联结构可直接运行。实验表明,SIGMA在六个推理与编码基准上均取得最佳平均性能,相较于最强的非组合式拓扑基线CARD,在三种基础大语言模型下分别提升2.06、2.36和1.75分;同时对未见技能库展现出更强鲁棒性,平均性能下降仅0.96分。结果表明,基于技能的组合式节点构造是超越通信拓扑优化的另一关键维度,为多智能体系统设计提供了新的范式。
链接: https://arxiv.org/abs/2606.19758
作者: Kun Zeng,Yu Huo,Siyu Zhang,Yuecheng Zhuo,Yuquan Lu,Haoyue Liu,Siyue Chen,Xiaoying Tang
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: EMNLP2026
Abstract:Existing graph-based multi-agent system (MAS) designers mainly improve collaboration by optimizing communication topologies over predefined agents, roles, or groups. However, because each node remains a closed-set entity, these methods struggle to generalize to tasks that require unseen combinations of capabilities. We propose SIGMA, a skill-incidence graph framework that constructs agents as task-conditioned bundles of reusable skills. Given a task and a skill library, SIGMA predicts a skill-agent incidence matrix, composes agent node embeddings from selected skills, and decodes a communication topology over the constructed agents. During execution, skill-specific mailboxes route messages to the relevant assigned capabilities, making the incidence structure directly operational. Across six reasoning and coding benchmarks with three base LLMs, SIGMA achieves the best average performance and improves over CARD, the strongest non-compositional topology-based baseline, by 2.06, 2.36, and 1.75 points, respectively. It also shows stronger robustness to unseen skill libraries, with an average performance drop of only 0.96 points. These results suggest that compositional node construction is a complementary and important axis for multi-agent design beyond communication topology optimization. Code is available at this https URL.
[MA-10] Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware
【速读】:该论文旨在解决低级别C语言固件代码在单元测试(Unit Test, UTs)验证过程中因严格构建约束导致的测试生成困难问题,具体表现为头文件缺失、符号未解析及依赖项不匹配等常见编译与链接失败。其核心解决方案是提出一种基于大语言模型(Large Language Model, LLM)驱动的多智能体自动化单元测试编写流程,针对AMD维护的开源硅初始化库(openSIL)代码库进行优化。该流程的关键在于:通过自动化生成测试框架、结合库感知策略创建或复用桩函数(stubs)、模拟对象(mocks)和假实现(fakes),并引入基于构建日志与行覆盖率反馈的迭代式编译-分派修复循环,实现高效且可编译的单元测试自动生成。实验结果表明,在76个待测函数中,该方法成功生成了73个可编译的单元测试;在48个函数的子集上,仅使用行覆盖率引导即达到98.8%的平均覆盖率,结合向量数据库检索后仍保持94.7%的高覆盖率,显著提升了固件环境下单元测试的生成效率与覆盖质量,同时大幅减少人工调试成本。
链接: https://arxiv.org/abs/2606.19725
作者: Ma Toan Bach,Yuchi Zheng,Haingo Razafindranto,Tanvir Alam,Aric Leather,Ranveer Sandhu,Jitesh Arora
机构: Seneca Polytechnic (森尼卡理工学院); Advanced Micro Devices (AMD)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 20 pages, 10 figures
Abstract:Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.
[MA-11] Exit-and-Join Dynamics for Decentralized Coalition Formation
【速读】:该论文旨在解决合作博弈中联盟形成过程的去中心化动态演化问题,核心在于刻画个体基于局部收益评估所做出的单边退出与加入决策如何引导系统收敛至稳定联盟结构。其解决方案的关键在于引入Aumann-Dreze值作为局部收益分配准则,使每个参与者仅依据自身所在联盟内的支付分配来评估策略调整的优劣,从而将合作博弈中的收益分配机制与非合作博弈中的最优响应行为相耦合。在此框架下,终端划分被严格定义为不存在可获利的、个体可接受的退出-加入偏离的联盟结构。研究通过建立均衡特性刻画,识别出在特定条件下动态过程可由标量李雅普诺夫函数或精确势函数(exact potential)表征,并分析了切换成本与接受成本对局部稳定性的调控作用。数值实验进一步验证了有限时间收敛性、成本敏感性以及凸博弈情形下的基准表现。
链接: https://arxiv.org/abs/2606.19683
作者: Quanyan Zhu
机构: New York University Tandon School of Engineering (纽约大学坦顿工程学院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
Abstract:This paper studies coalition formation as a decentralized dynamical process driven by unilateral exit-and-join decisions. Agents evaluate local moves using the Aumann-Dreze value, so payoffs are computed within the agent’s current coalition rather than through a globally negotiated coalition structure. The resulting model links cooperative payoff allocation with noncooperative best-response behavior: a terminal partition is precisely a coalition structure with no admissible, individually profitable exit-and-join deviation. We establish equilibrium characterizations, identify conditions under which the dynamics admit scalar Lyapunov or exact-potential representations, and analyze how switching and acceptance costs shape local stability. Numerical experiments test finite-time stabilization, cost sensitivity, and a special convex-game benchmark.
[MA-12] Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation IROS2026
【速读】:该论文旨在解决多智能体强化学习(MARL)中神经策略缺乏形式化安全保证的问题,尤其针对无人机集群与自动驾驶车辆车队等安全关键场景下的部署需求。其核心挑战在于:尽管MARL能够通过涌现通信实现智能体间的协调,但深度神经网络的黑箱特性使其难以进行可验证的安全性分析。解决方案的关键在于提出首个端到端的安全验证框架,通过策略抽象(policy abstraction)将复杂的神经策略蒸馏为可解释的决策树,并基于此进行形式化验证。该框架包含四个阶段:针对任务特性的特征提取、高保真度(97.9% ± 1.2%)的决策树蒸馏、将特征与状态变量完全对应地自动转换为PRISM概率模型检查器的规格描述,以及采用成对分解与并集界聚合的组合式验证方法,对概率计算树逻辑(PCTL)性质进行高效验证。实验在5-7个智能体的多无人机协同任务中评估了向量量化变分信息瓶颈(VQ-VIB)策略,成功验证了18项涵盖安全性、活锁性和协作性的时序逻辑属性,满足所有五项安全阈值(碰撞概率0.3% < 1%阈值),且原始神经网络的蒙特卡洛验证表明安全属性转移偏差仅0.6个百分点(95%置信区间)。此外,离散的VQ-VIB消息相比连续方法在保真度上提升11.6至13.6个百分点,使验证速度提升3-4倍。该框架实现了从深度MARL到形式化安全工作流之间的可验证桥梁,为多机器人系统的实际部署提供了实证支持。
链接: https://arxiv.org/abs/2606.19632
作者: Ahmad Farooq,Kamran Iqbal
机构: University of Arkansas at Little Rock (阿肯色大学小岩城分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 9 pages, 3 figures, 7 tables. Accepted at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026), Pittsburgh, Pennsylvania, USA, September 27-October 1, 2026
Abstract:Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with =0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.
[MA-13] Before the Pull Request: Mining Multi-Agent Coordination
【速读】:该论文旨在解决自主编码代理(autonomous coding agents)在大规模协作中出现的“协调与信任鸿沟”问题:尽管这些代理能快速生成大量代码合并请求(pull request, PR),但其被接受的比例反而较低,而现有基于PR级别的遥测数据无法解释这一现象。其核心问题在于,当前研究忽略了在PR创建之前,多个代理在共享任务上的争用、分工与冲突过程中的隐性协调信号。为此,论文提出grite——一个无需中心服务器的开源协调基础架构,将所有协作记录直接存储于git内部,利用其追加式、带签名的事件日志(append-only, signed event log)来原生捕获协作过程。解决方案的关键在于:(1)通过共享的协同底座显著降低重复与冲突工作,在可控开销下将重复执行他人任务的比例从78%降至0%,同时有效吞吐量提升超过三倍;(2)每个代理本地的日志副本均能收敛至一致状态,避免了文件级追踪器中常见的并发写入丢失问题;(3)该事件日志本身成为可挖掘的元数据资产,能够自动识别并恢复多种具体故障模式(如冲突编辑、锁饥饿、冗余任务重发现、竞态关闭等),且其溯源信息可揭示传统PR历史中不可见的协作异常。论文同时公开了数据集、工具链与挖掘框架。
链接: https://arxiv.org/abs/2606.19616
作者: Dipankar Sarkar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 2 tables. LNCS format. Code, dataset, and mining toolkit: this https URL
Abstract:Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate’s task falls from 78% to 0% while useful throughput more than triples; (ii) every agent’s copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.
[MA-14] Mesh Inference: A Formal Model of Collective Intelligence Without a Center
【速读】:该论文旨在解决在无中心协调器、无任何代理暴露内部状态的前提下,由多个独立智能体基于私有状态和仅限于经许可的类型化观测信息,协同推导出单一智能体无法单独获得的全局结论这一核心问题。其关键解决方案在于构建一个形式化的“网格推理”(mesh inference)模型,将整个推理过程建模为各智能体局部松弛耦合自由能(coupled free energy),并通过单一的准入/发射策略(admission/emission policy)统一控制三个核心性质:首先,无论准入机制是否对称,网格推理均收敛至唯一解,因耦合结构始终构成M-矩阵;其次,系统具备识别完备性(identification-complete),当各参与视角呈载体连通(carrier-connected)时,可精确实现集中式最优解;第三,系统为纯观测驱动(observation-only),任意节点不传输权重、梯度或隐藏状态,保密性与可识别性互为对偶,唯一全局信道为内容寻址的溯源信息(content-addressed lineage)。在高斯线性假设下,所有推导结果均可被完全确定,且在O(diam2)延迟内达到集中式最优解,此延迟即为移除中心所付出的代价。该推导过程对应一次去中心化学习循环的一轮迭代,其架构被形式化定义而非证明。论文提出的开放问题在于:何时“提问”能提升集体性能而非导致错误——即非线性闭包是否产生更优答案,还是生成自信的误判。据我们所知,这是首个关于网格推理的形式化模型。
链接: https://arxiv.org/abs/2606.19537
作者: Hongwei Xu
机构: SYM.BOT(对称机器人)
类目: Multiagent Systems (cs.MA); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 21 pages, 2 figures
Abstract:We present a formal model of mesh inference: how a population of independent agents, each holding private state and exchanging only admitted, typed observations, derives a conclusion none of them holds alone, with no central coordinator and no agent exposed. No agent shares weights, gradients, or hidden state, and the agents may span different teams, networks, and organizations. Motivated by the observation that asking a model is energy-minimizing inference, we model the mesh as a coupled free energy that each agent relaxes locally. We show that a single admission/emission policy governs three properties. First, mesh inference converges to a unique answer for any admission, symmetric or not, because the coupling is always an M-matrix. Second, it is identification-complete: it derives the centralized optimum exactly when the contributing views are carrier-connected. Third, it is observation-only: no node transmits its internals, and confidentiality is the dual of identification. Content-addressed lineage is the only global side-channel. In the linear-Gaussian regime every derived answer is determined, hence equal to the centralized optimum, at O(diam^2) latency, the measured price of removing the center. One such derivation is one turn of a center-free learning loop, which we formalize as architecture rather than prove. The open problem we state is when asking improves the collective rather than corrupting it: whether the non-linear closure derives an upgraded answer or a confident error. To our knowledge, this is the first formal model of mesh inference.
[MA-15] Deontic Policies for Runtime Governance of Agent ic AI Systems
【速读】:该论文旨在解决由大型语言模型(Large Language Models, LLMs)驱动的自主代理型人工智能系统(Autonomous Agentic AI Systems)在企业环境中引发的安全、隐私与合规性挑战。传统访问控制机制(如认证和权限管理)无法充分应对具备跨组织协作、工具调用、数据操作及软件安装能力的智能体所带来的复杂治理需求。核心问题在于,现有策略引擎(如XACML、Rego、Cedar)仅支持“允许/禁止”类规则,缺乏对义务(obligation)生命周期管理、政策冲突化解、特定情境下义务豁免(dispensation)、以及基于领域本体(ontology)的推理能力的支持,而这在医疗、网络安全、数据隐私等高合规要求场景中至关重要。本文提出的解决方案——AgenticRei,其关键在于采用基于Rei框架构建的道义逻辑(deontic logic)策略语言,并以Web本体语言(OWL)形式表达,通过一个独立于LLM的高性能逻辑引擎在运行时进行评估。该方法不仅实现了对允许/禁止规则的管控,还完整支持义务设定、豁免机制、冲突消解及跨领域本体推理,统一监管智能体的工具调用与代理间通信行为。实验表明,此类道义策略能够刻画当前生产级引擎难以表达的深层次安全与隐私治理约束,且可自然集成至A2AS等工业标准框架中。
链接: https://arxiv.org/abs/2606.19464
作者: Anupam Joshi,Tim Finin,Karuna Pande Joshi,Lalana Kagal
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages, 1 figure. To be published in the 2026 IEEE Symposium on Agentic Services which is part of the IEEE Conference on Web Services
Abstract:Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not just by authentication and access control, but by the full structure of enterprise governance. This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under what conditions a standing obligation may be waived, and which rules take precedence when policies conflict. This governance problem exceeds what current policy engines provide. Systems such as XACML, Rego, and Cedar address only the permit/prohibit subset of this governance structure. They do not provide obligation lifecycle management, meta-policy conflict resolution, dispensations that waive obligations in specific circumstances, and ontological reasoning over domain class hierarchies commonly found in applications such as healthcare, cybersecurity, or data privacy. We propose AgenticRei, which realizes key governance requirements such as obligations, dispensations, policy conflict resolutions, and reasoning over policies, as well as the basic permit/prohibit constraints. We use a deontic policy language built on the Rei framework, expressed as OWL (Web Ontology Language) and evaluated at runtime by a high-performance logic engine entirely outside the LLM. The same pipeline governs both tool invocations by the agent and agent-to-agent messages. We show through examples that deontic policies capture governance constraints around security and privacy that mostly cannot be expressed in current production engines. Our approach composes naturally with industry-standard frameworks like A2AS.
[MA-16] Human-like autonomy emerges from self-play and a pinch of human data
【速读】:该论文旨在解决纯自对弈强化学习(self-play reinforcement learning)在自动驾驶策略训练中导致的行车行为与人类驾驶习惯不兼容的问题。此类方法虽可利用大规模低成本仿真替代昂贵的人类驾驶示范数据,但其生成的策略往往演化出与人类驾驶员行为模式相悖的“异化”驾驶惯例,造成实际部署中的安全与协同风险。现有解决方案依赖复杂的奖励工程和领域随机化(domain randomization),但存在鲁棒性差、人工成本高等缺陷。本文提出一种新方法,将少量人类示范数据作为正则化目标,叠加于最小安全目标达成奖励之上,实现对策略行为的引导。该设计类似于“调味料之于佳肴”,仅需30分钟的人类示范数据(较典型模仿学习方法减少2500倍),即可有效引导策略学习与人类轨迹保持一致,同时在单个消费级GPU上15小时内完成训练,显著提升了训练效率与行为对齐性。
链接: https://arxiv.org/abs/2606.19370
作者: Daphne Cornelisse,Julian Hunt,Zixu Zhang,Waël Doulazmi,Kevin Joseph,Jaime Fernández Fisac,Eugene Vinitsky
机构: NYU Tandon School of Engineering (纽约大学坦登工程学院); NYU Courant (纽约大学库朗研究所); Princeton University (普林斯顿大学); Centre for Robotics, Mines Paris (巴黎矿业机器人中心); Valeo (法雷奥)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages
Abstract:Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500x fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available at this https URL.
[MA-17] Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies
【速读】:该论文旨在解决生成式 AI(Generative AI)在非西方文化背景下的招聘应用中是否存在性别偏见的问题,尤其关注日本企业语境下是否存在对女性候选人的偏好偏差,并评估两种实际可操作的缓解策略的有效性。研究发现,在使用60份日式履历(rirekisho格式)与12组基于语言学性别信号筛选的名字配对,通过五种先进大语言模型(Claude Sonnet 4.6、GPT-4o、DeepSeek-V3、Gemini 2.5 Flash、Llama 3.3 70B)进行43,200次API调用后,所有模型均表现出显著的亲女性偏倚,这一结果在非西方语境中复现了西方研究的发现。关键发现在于:在提示层面上施加性别中立指令无法有效降低偏倚;而对候选人姓名的依赖性分析明确指出,姓名是主要的性别信息通道——从提示中移除姓名可使女性效应几乎完全消失。此外,研究揭示了一个意外的实践挑战:隐私过滤器与GPT-4o的内容安全机制之间存在不兼容性,导致高达42%的请求被拒绝,凸显了在招聘流程中实施姓名匿名化时的技术部署障碍。因此,解决方案的关键在于识别并消除输入提示中的姓名作为核心性别线索,而非依赖表面化的提示调整或被动的过滤机制。
链接: https://arxiv.org/abs/2606.18649
作者: Serena A. Hoffstedde,Machiko Hirota,Akshara Nadayanur Sathis Kanna,Rihito Kotani,Ujwal Kumar,Gabriele Trovato,Phan Xuan Tan
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o’s content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.
[MA-18] Semiglobal Input-Delay Tolerance Algorithm for Distributed Nonconvex Optimization of Networked Nonlinear Systems
【速读】:该论文旨在解决网络化非线性系统(NNSs)中存在输入时延与一致性约束下的分布式优化问题。核心挑战在于如何在输入时延与非线性动态耦合的复杂环境下,确保各节点状态在满足一致性约束的前提下收敛至全局最优解。其解决方案的关键在于提出一种新型半全局输入时延容错(SIDT)算法,该算法基于分层设计框架与输入到状态稳定性(ISS)分析,实现了对输入时延的容忍性,并在任意预设紧致初始集下保证最优解的可达性与状态的一致性收敛。进一步地,通过引入Polyak-Łojasiewicz(PL)条件放宽严格凸性假设,使算法可扩展至非凸优化场景,显著提升了适用范围。数值实验验证了该方法在具有输入时延的网络化非线性系统中的有效性与理论正确性。
链接: https://arxiv.org/abs/2606.19871
作者: Jing-Zhe Xu,Zhi-Wei Liu,Ming-Feng Ge,Yan-Wu Wang,Dinxin He
机构: 未知
类目: Optimization and Control (math.OC); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 36 pages, 5 figures
Abstract:This paper studies a class of distributed optimization problems in networked nonlinear systems (NNSs) subject to input delays and consensus constraints. It introduces input-delay tolerant semiglobal convergence (IDTSC), meaning that for any prescribed compact initial set there exists an admissible delay bound under which the optimal solution is computed within consensus constraints and all node states converge to the solution. Building on a hierarchical design and input-to-state stability analysis, a new semiglobal input-delay tolerant (SIDT) algorithm is developed that practically achieves IDTSC for distributed optimization under the coupling between input delays and nonlinear dynamics. Further, by relaxing strict convexity requirements through the Polyak-Łojasiewicz condition, the SIDT algorithm broadens its applicability to nonconvex optimization. Finally, numerical experiments corroborate the theory on NNSs with input delays.
自然语言处理
[NLP-0] LedgerAg ent: Structured State for Policy-Adherent Tool-Calling Agents
【速读】: 该论文旨在解决在客户服务领域中,遵循策略的工具调用智能体在多轮交互过程中难以有效维护任务状态的问题。传统方法将任务状态信息(如相关事实、标识符、约束条件和用户交互与工具调用所观察到的条件)混杂于提示(prompt)之中,导致智能体每次决策时需从上下文重新推断状态,从而引发两种常见故障:一是虽获取正确信息但后续决策基于过时或错误的状态;二是尽管工具调用语法正确,却因未考虑当前任务状态而违反领域策略。其解决方案的关键在于提出一种推理时(inference-time)的方法——\textscLedgerAgent,通过引入一个独立的“账本”(ledger)来显式维护所有观测到的任务状态,并在生成提示时将这些状态显式注入,同时在执行可能改变环境的工具调用前,利用账本验证依赖状态的策略约束,从而防止策略违规。实验表明,在四个客户服务领域及混合使用开放与闭源模型的测试场景下,\textscLedgerAgent显著提升了平均 pass@k 指标,尤其在更严格的多轮一致性评估中表现突出。
链接: https://arxiv.org/abs/2606.20529
作者: Md Nayem Uddin,Amir Saeidi,Eduardo Blanco,Chitta Baral
机构: Arizona State University (亚利桑那州立大学); University of Arizona (亚利桑那大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in Progress
Abstract:Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textscLedgerAgent, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textscLedgerAgent improves average pass\textasciicircumk over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.
[NLP-1] StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLM s ICML2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在社会性判断中对视觉线索的敏感性及其所引发的属性层面社会偏见问题。现有研究常因比较不同个体而难以区分外观特征影响与身份差异带来的效应,导致偏见归因不准确。为此,论文提出StylisticBias——一个受控的基准测试框架,通过生成500个逼真的基础人脸,并为每张人脸创建约50个单一属性变异图像(共约2.5万张),在保持身份恒定的前提下仅改变一个视觉属性,从而实现对特定视觉线索如何影响模型判断的精准测量。关键解决方案在于采用“固定身份、单变量变化”的实验设计,使偏见分析可归因于具体视觉特征而非身份混淆。实验评估六种主流MLLM在25种二元社会判断任务中的表现,发现年龄、体型等身份相关属性主导整体效应,而时尚风格等视觉线索则在属性层面引起最大判断偏移;约15个属性贡献了近80%的总变异,表明偏见高度集中于少数关键视觉特征。此外,当社会判断语义上与外观相关(如经济地位、风格偏好)时,模型敏感性最强。该研究释放了StylisticBias作为细粒度偏见评估的基准工具,为后续多模态模型公平性研究提供了可复现的实验平台。
链接: https://arxiv.org/abs/2606.20527
作者: Shaghayegh Kolli,Timo Cavelius,Nafiseh Nikeghbal,Samantha Dalal,Jana Diesner
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Princeton Center for Information and Technology Policy (普林斯顿信息与技术政策中心)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026
Abstract:Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: this https URL and this https URL.
[NLP-2] Beyond Global Replanning : Hierarchical Recovery for Cross-Device Agent Systems
【速读】: 该论文旨在解决多设备环境下,智能体在执行跨应用、跨设备复杂任务时,因运行时故障导致的恢复能力不足问题。现有系统虽支持任务分解与跨设备分配,但其故障恢复机制仍以粗粒度为主,缺乏对设备本地策略空间的系统性建模,难以区分可由单设备内策略修复的故障与需全局重规划的故障。为此,论文提出H-RePlan框架,其核心在于构建分层式重构机制:通过统一的API-CLI-GUI执行接口,为每个设备配备可互换的执行策略,并利用紧凑的跨层故障抽象,将设备级局部策略恢复与协调器层面的全局重规划相分离。这一设计实现了对故障作用范围的显式感知,从而实现更精准、高效的恢复决策。为验证该方案的有效性,作者构建了HeraBench故障注入基准测试平台,涵盖Linux与Android设备间的跨设备工作流,并引入策略级与设备级故障。实验结果表明,相较于单一策略及粗粒度多设备基线,H-RePlan显著提升了任务完成率、指令遵循率和完美通过率,同时降低了实现端到端可靠执行所需的令牌消耗,证明了面向作用域感知的分层恢复机制对于提升多设备智能体鲁棒性的关键作用。
链接: https://arxiv.org/abs/2606.20487
作者: Shu Yao,Yuhua Luo,Qian Long,Jingru Fan,Zhuoyuan Yu,Yuheng Wang,Lin Wu,Yufan Dang,Huatao Li,Chen Qian
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Southeast University (东南大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbfH-RePlan, a hierarchical replanning framework for multi-device agents with unified API–CLI–GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbfHeraBench, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.
[NLP-3] Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology MICCAI2026
【速读】: 该论文旨在解决在放射科领域训练视觉-语言模型(VLMs)时缺乏人工空间标注的问题,即如何在不依赖昂贵且耗时的手动空间标注的前提下,实现视觉与语言的精准对齐。其核心解决方案是构建一个大规模、双语(德语/英语)的临床影像-文本数据集RefRad2D,包含120万组CT和MR图像-文本配对,并通过大语言模型(LLM)驱动的自动化筛选与分割技术,自动生成任务特定的视觉问答(VQA)和空间定位子集。在此数据集上训练的RadGrounder模型能够联合执行报告生成、视觉问答及基于边界框或分割图的空间定位任务。实验表明,该模型在外部VQA基准(Slake、VQA-RAD)上表现媲美专用医学VLMs;更重要的是,将该临床数据加入训练混合数据中可显著提升开放性VQA性能,优于仅在下游数据集上微调的效果,验证了数据集的可迁移性;同时,引入空间定位监督并未损害语言生成质量,实现了空间可验证输出与保持高水平VQA性能的双重目标,关键突破在于无需额外成本即可实现多任务协同优化。
链接: https://arxiv.org/abs/2606.20477
作者: Yusuf Salcan(1 and 4),Simon Ging(1 and 2),Robin Schirrmeister(3),Philipp Arnold(3),Elmar Kotter(3),Behzad Bozorgtabar(2),Thomas Brox(1) ((1) Computer Vision Group, University of Freiburg, Germany, (2) Adaptive amp; Agentic AI (A3) Lab, Aarhus University, Denmark, (3) Department of Radiology, Medical Center – University of Freiburg, Germany, (4) CRIION-AI Lab, Freiburg, Germany)
机构: University Hospital Heidelberg (海德堡大学医院); German Research Foundation (德国研究基金会); Baden-Württemberg State (巴登-符腾堡州)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision
Abstract:We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.
[NLP-4] CATCH-ME if you RAG : a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
【速读】: 该论文旨在解决在线仇恨言论与虚假信息交叉泛滥背景下,自然语言处理(NLP)研究中对二者分别处理导致的生成式反言(counterspeech)模型效果受限的问题。现有大语言模型(LLM)在零样本场景下生成的反言内容常出现重复性高、语义模糊等缺陷,亟需高质量示例以引导模型生成更具说服力和事实依据的回应。然而,当前针对仇恨与虚假信息交集的反言数据集稀缺,且多局限于单轮英语对话,难以覆盖真实交互中的多轮对话特征及多语言需求。为此,本文提出首个大规模、专家标注、多语言的对话型反言数据集,涵盖五种语言,聚焦七类边缘群体所遭受的仇恨攻击,并通过整合经验证的外部知识(如事实核查文章与非政府组织报告),实现内容的事实锚定。该数据集包含文档级与片段级的标注信息,可直接用于检索增强生成(RAG)系统,为训练与评估更精准、更具事实依据的多轮跨语言反言模型提供了关键资源。其核心解决方案在于构建一个兼具多语言性、多轮对话结构、事实可验证性与结构化标注的高质量反言数据集。
链接: https://arxiv.org/abs/2606.20369
作者: Helena Bonaldi,Genoveffa Martone,Marco Guerini
机构: Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会); Università Cattolica del Sacro Cuore(天主教圣心大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer model generation. However, existing counterspeech datasets against the overlap of hate and misinformation are scarce and limited to single-turn English dialogues, while real-life interactions span across multiple turns and languages. To bridge this gap, we introduce the first large-scale, expert-curated, multilingual dataset of dialogues tackling the intersection of hate and misinformation. To ensure factual grounding, the dialogues are also anchored in verified external knowledge (i.e., fact-checking articles and NGO reports) and include document- and chunk-level span annotations, making it directly applicable for RAG systems. Covering five languages and targeting hate directed at seven marginalized groups, this novel resource enables the training and evaluation of more persuasive, factually grounded counterspeech models.
[NLP-5] oken-Operations-Oriented Inference Optimization Techniques for Large Models
【速读】: 该论文旨在解决大模型推理优化中的高成本、低效率与服务稳定性不足问题,核心挑战在于如何在保障服务质量的前提下实现推理过程的规模化、低成本与高稳定性。其解决方案的关键在于首次提出一个四层协同的技术架构:多模型融合(Multi-model Fusion)、模型优化(Model Optimization)、计算-模型融合(Compute-Model Fusion)以及计算-网络-模型融合(Compute-Network-Model Fusion),通过分层协同优化实现从模型到系统资源调度的全链路效率提升,为降低单个token生成成本、提高服务吞吐能力、确保token供应稳定性和推动大模型服务由“可调用”向“可运营”演进提供了可落地的技术路径。
链接: https://arxiv.org/abs/2606.20295
作者: Shiguo Lian,Kai Wang,Zhaoxiang Liu,Wen Liu,Minjie Hua,Yutong Liu,Jiangze Yan,Xin Wang,Cong Wang,Yilin Zhang,Yi Shen,Jieyun Huang,Fang Zhao,Huanlin Gao,Ping Chen,Xinyu Yang,Kaikai Zhao,Yao Zhao,Xinggang Wang,Huishuai Zhang,Dongyan Zhao,Junping Du,Tao Chen,Xiang Gao,Qinghuai Ma
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 62 pages, 36 figures
Abstract:Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.
[NLP-6] PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback
【速读】: 该论文旨在解决现有自动作文评分(AES)系统中评分与教学反馈割裂的问题:传统神经网络评分模型缺乏可解释性,而基于大语言模型(LLM)的反馈又难以根据学习者的能力水平进行动态调整。其核心解决方案是提出PsyScore——一个基于心理测量学(Psychometrics)的整合框架,通过共享的潜在能力表示(latent ability representation)实现诊断评估与教学支架(instructional scaffolding)的统一。该框架的关键在于:(1)采用融合分级部分信用模型(GPCM)的特质自适应神经项目反应理论(Neural IRT)评分器,兼顾评分精度与心理测量可解释性;(2)设计基于最近发展区(ZPD)的多智能体反馈生成机制,依据诊断出的学生能力参数动态调节反馈策略,实现分层适配的教学支持;(3)引入多视角反馈评估策略,结合成对偏好判断与学生修订模拟,全面评估反馈的教育有效性。实验在ASAP++数据集上验证了PsyScore在保持竞争性评分性能的同时,显著提升了反馈的教育适切性。
链接: https://arxiv.org/abs/2606.20287
作者: Wei Xia,Jin Wu,Haoran Shi,Xiangyu Wang,Chanjin Zheng
机构: East China Normal University (华东师范大学); Shanghai Institute of Artificial Intelligence for Education, East China Normal University (华东师范大学人工智能教育研究所); School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.
[NLP-7] he Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse
【速读】: 该论文旨在解决当前人工智能系统在处理尼日利亚公共话语时因语境缺失(context failure)而导致的误判问题,尤其针对现有基准(如NaijaSenti和AfriSenti)将情感分析简化为三分类极性任务(正向、负向、中性)所引发的深层误解。其核心挑战在于:同一语句在不同说话者、受众及情境下可能具有截然相反的语用功能,而传统模型未能区分表面情绪与真实交际意图。为此,论文提出意义智能框架(Meaning Intelligence Framework, MIF),通过九维可量化的标注与评估体系,系统解构话语的深层语义结构,包括语域(register)、表面情感、真实意图、反讽、编码隐含意义、风险层级、标注者置信度、说话者情绪及推荐沟通行动等维度。该框架的关键创新在于引入上下文感知的提示工程机制,实验证明,在零样本条件下,模型对语域的识别准确率仅为33.3%,而当引入MIF框架作为上下文提示后,准确率提升至73.3%(+40个百分点),且在编码隐含意义识别与战略沟通建议等关键任务上分别获得10分和10.3分的显著提升,整体意义智能得分从73.2升至78.6。研究进一步公开了框架规范、标注指南及30项校准数据集以保障可复现性,并保留私有测试集用于抗污染评估,为多语种、多语域语境下的生成式AI优化提供了可扩展的评估范式。
链接: https://arxiv.org/abs/2606.20255
作者: Celestine Achi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. 12 pages, 2 tables. Supplementary materials: MIF Master Specification v2.0, Annotation Guidelines v1.0, and 30-item public calibration set with gold labels available from the author
Abstract:We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for Nigerian languages, including NaijaSenti and AfriSenti, treat sentiment classification as a three-way polarity task (positive, negative, neutral). We argue that the dominant failure mode of AI systems on Nigerian discourse is not translation failure but context failure: the same utterance carries opposite pragmatic force depending on speaker, audience, and situation. The MIF operationalises this insight across nine scored dimensions: register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended communications action. We construct a 30-item calibration dataset spanning Standard English, Nigerian English, Nigerian Pidgin, and code-mixed registers, and evaluate a frontier language model (Gemini 2.5 Flash) under zero-shot and schema-informed prompting conditions. The headline finding is the Register Gap: zero-shot register classification accuracy is 33.3%, rising to 73.3% (+40 points) when the model receives the MIF schema in-context. The composite Meaning Intelligence Score increases by 5.4 points (73.2 to 78.6) under schema-informed prompting, with the largest practical gains in register identification, coded-subtext detection (+10 points), and strategic action recommendation (+10.3 points). We release the framework specification, annotation guidelines, and the 30-item public calibration set to support reproducibility, while retaining a private holdout corpus for contamination-protected evaluation.
[NLP-8] Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families
【速读】: 该论文旨在解决在不安全代码上微调语言模型所引发的生成内容与预期对齐性之间的隐式错位(misalignment)问题,其核心挑战在于这种错位现象是否具有可被因果操控的、跨架构共享的激活空间方向。研究发现,在四个采用相同微调策略的指令微调模型家族(Qwen2.5-1.5B、Gemma-2-2B、Llama-3.2-1B、Ministral-3-3B)中,一个基于均值差异(difference-in-means)提取的激活空间方向可在各模型最后一层实现高达99.6%的对齐与错位激活分离。通过因果操纵(causal steering),减去该方向可使代码泄漏(code spillover)指标降低21–51点,且安全代码对照实验验证了该效应的内容特异性。然而,利用岭回归进行跨架构迁移时虽能显著抑制行为(最高达46点下降),但无法通过特异性控制——随机或正交方向亦表现相当,表明此类方向虽具因果效力却缺乏内容特异性。研究进一步揭示出一种双层级特异性结构:模型内方向具备因果特异性和可操作性;而跨模型方向虽具因果实在性,但不具备特异性。此外,观察到非对称的迁移拓扑结构,其中Gemma和Qwen表现为几何“捐赠者”,而Llama则为“接收者”。这些结果限定了线性跨架构修正的有效边界,并建议在模型审计中优先采用模型内探针(within-model probing)方法。
链接: https://arxiv.org/abs/2606.20225
作者: Abdul Rafay Syed
机构: Universität des Saarlandes (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 figures
Abstract:Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model’s final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a secure-code control confirms content specificity. Cross-architecture transfer via ridge regression maps yields large behavioral suppression (up to 46 points) but fails specificity controls as random and orthogonal directions perform comparably. We identify a two-tier specificity structure: within-model directions are causally specific and actionable; cross-model directions are causally real but non-specific. An asymmetric transfer topology emerges, with Gemma and Qwen acting as geometric donors and Llama as a receiver. These findings define the limits of linear cross-architecture correction and recommend within-model probing for auditing.
[NLP-9] CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia
【速读】: 该论文旨在解决在机器翻译(Machine Translation, MT)过程中如何有效保持文档格式一致性的问题,尤其针对多语言、多格式(HTML、DOCX、PDF)的正式文档。现有方法在处理复杂排版结构时往往忽略布局信息,导致翻译后文档出现格式错乱,影响可读性和实用性。为此,本文提出CzechDocs数据集,这是一个涵盖捷克语及捷克境内主要少数语言(如乌克兰语、英语,以及少量越南语、俄语等)的多语言平行文档数据集,支持多种文档格式。其核心贡献在于构建了一个可用于评估格式保持型机器翻译系统的基准数据集,并提供了验证子集与配套评估工具包,以系统比较主流格式保持策略的有效性。关键解决方案在于通过真实世界文档的对齐与标注,建立一个能够反映实际应用场景中格式与内容双重约束的数据集,从而推动面向文档级翻译的格式保持技术发展。
链接: https://arxiv.org/abs/2606.20212
作者: Josef Jon,Ondřej Bojar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present CzechDocs, a multiway parallel dataset of formatted documents (HTML, DOCX, and PDF) covering Czech and minority languages used in Czechia-primarily Ukrainian and English, with smaller portions of Vietnamese, Russian and other languages. The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation. We provide a comparison of the most common approaches to format-preserving machine translation on a validation subset of the dataset. This validation split, together with the evaluation toolkit, is publicly released for further research. A held-out test split will be reserved for a future shared task focused on document-level translation with formatting preservation.
[NLP-10] Pitch Spelling Jazz Lead Sheets Solo Transcriptions Classical Piano and Monophonic Scores
【速读】: 该论文旨在解决音乐记谱中音高命名(pitch spelling)与调性估计(key estimation)的联合优化问题,即在给定以MIDI格式表示的音符信息(包含相对于最低参考音高的半音数及小节边界)的基础上,准确推断出每个音符的规范名称、全局调号(Key Signature)以及每小节的局部调式(local scale)。其核心挑战在于如何在保证音乐合理性的同时,最小化印刷乐谱中变音记号的使用数量,并实现整体记谱的音乐学一致性。解决方案的关键在于分两阶段进行优化:第一阶段为“调式(modal)”阶段,通过最短路径搜索算法为每小节候选最可能的调式,以最小化所需变音记号;第二阶段为“调性(tonal)”阶段,基于前一阶段生成的局部调式,联合估计全局调号与音符命名,从而实现整首作品最优的音乐记谱表达。该方法特别适用于从音频中自动转录爵士乐独奏等场景,支持音乐分析、教学及文化遗产保存,同时引入了新的爵士调式间距离度量,对音乐学研究亦具参考价值。
链接: https://arxiv.org/abs/2606.20198
作者: Augustin Bouquillard(X),Florent Jacquemard(CEDRIC - VERTIGO)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present an algorithm for pitch spelling and key estimation. Given an input in MIDI-like format, containing information on note pitches (expressed in semitones relative to the lowest reference note) and bar boundaries, it estimates the appropriate note names, a global Key Signature, and a local scale for each bar. This related information elements are evaluated jointly during two stages of optimisation. During an initial ‘modal’ stage, a probable scale is proposed for each bar, minimising the number of accidentals to be printed in the printed score with a shortest-path search. Then, during a second stage called ‘tonal’, these local scales are used to estimate the Key Signature and note names that would result in the best musical notation for the entire piece. We present evaluations conducted on datasets comprising a variety of digital musical scores: jazz lead sheets taken from the Real Book, transcriptions of recordings of jazz soli and bass lines, traditional tunes, as well as classical scores for piano and monophonic instruments. Our procedure was originally designed for use in music transcription, specifically for building digital collections of jazz solos transcribed from audio recordings, for the purposes of music analysis, teaching and the preservation of cultural heritage. This method should also prove useful for other tasks related to the processing of musical notation. Furthermore, to this end, we have defined new distances between various common jazz scales, which may be of some interest to musicological studies.
[NLP-11] ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion
【速读】: 该论文旨在解决现代希伯来语中字符到音素(G2P)转换的难题,其核心挑战源于希伯来语作为辅音音位文字(abjad)的书写系统,导致元音几乎不被标记,从而引发严重的语音歧义。传统方法依赖于先预测元音符号(nikud)以生成国际音标(IPA)转写,但受限于标注数据稀缺、人工标注成本高,且无法准确反映词重音等语音特征以及日常口语发音习惯。而直接采用序列到序列的IPA预测方法在数据有限时表现不佳,且未能利用辅音音位文字特有的字符级对齐特性。本文提出的ReNikud解决方案的关键在于两点:(1)通过基于音素的自动语音识别(ASR)伪标签管道,在数千小时无标注希伯来语语音数据上引入弱音频监督,生成反映自然口语规范的音素转写,避免了人工标注;(2)设计一种伪元音化架构,能够在每个字符位置上预测IPA音素,并将字符级对齐作为归纳偏置(inductive bias),有效捕捉辅音音位文字的语言结构特性。实验结果表明,ReNikud在现有希伯来语G2P基准及针对口语希伯来语的新建MILIM基准上均优于先前最先进方法,显著提升了转换准确性。研究团队将开源代码与训练模型,以推动希伯来语语音合成及相关技术的发展。
链接: https://arxiv.org/abs/2606.20179
作者: Maxim Melichov,Yakov Kolani,Morris Alper
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language’s abjad writing system, which leaves vowels largely unwritten, creating substantial ambiguity. Standard approaches first predict vowel diacritics (nikud) to produce International Phonetic Alphabet (IPA) transcriptions, but this is limited: vocalization data is scarce and laborious to produce, it does not specify features such as lexical stress, and it reflects formal grammatical rules rather than everyday spoken pronunciation. Direct sequence-to-sequence IPA prediction, meanwhile, struggles on limited data and fails to exploit the character-level alignment characteristic of abjads. Our method, ReNikud, overcomes these limitations with two key insights: (1) Weak audio supervision via a phoneme-based automatic speech recognition (ASR) pseudo-labeling pipeline on thousands of hours of unlabeled Hebrew audio, yielding phonemic transcriptions that reflect natural spoken norms without manual annotation. (2) A pseudo-vocalization architecture that predicts IPA phonemes at each character position, enforcing character-level alignment as an inductive bias. Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods. We will release our code and trained models to support further work on Hebrew TTS and speech technologies.
[NLP-12] MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning Sensor-Guided Screening Evidence-Grounded Decision Support and Community-to-Tertiary Referral Optimization
【速读】: 该论文旨在解决现有医疗大语言模型与检索增强生成系统在真实临床决策支持中面临的局限性,即难以有效处理异构且纵向的患者信息(如电子健康记录、医学影像、生理传感器数据、指南文本及转诊约束等),因其依赖单步提示或检索,导致在面对分散于长时序数据中的临床证据时表现脆弱。其解决方案的关键在于提出一种递归多模态健康智能框架(MedRLM),将患者病例视为可被递归审视、分解、检索、验证与合成的外部临床环境,而非压缩为单一提示输入。该框架通过协调多个专业化代理(涵盖临床文本、纵向电子健康记录、医学影像、生理信号、指南检索、不确定性审计与转诊规划)实现协同推理,并引入“临床证据图记忆”以连接患者特异性观察、检索到的证据、标准化定义、传感器衍生生物标志物及转诊标准。此外,基于传感器的递归触发机制可在检测到异常生理或行为模式时激活深度推理,而不确定性门控的精炼机制则支持高风险或低置信度案例的临床医生审核。整体设计推动医疗人工智能从静态问答迈向可审计、多模态且流程感知的临床决策支持。
链接: https://arxiv.org/abs/2606.20164
作者: Aueaphum Aueawatthanaphisut
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 9 pages, 3 figures, 3 tables, 1 Algorithm, 29 equations
Abstract:Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.
[NLP-13] NAMESAKES: Probing Identity Memorization in Text-to-Image Models
【速读】: 该论文旨在解决文本到图像(Text-to-Image, T2I)模型在生成特定个体肖像时存在的隐私泄露问题,即当输入某个人的名字时,模型可能生成与其高度相似的图像,这可能源于对训练数据中该个体图像的“记忆”(memorization)。现有方法在判断生成图像是否为记忆产物时,通常依赖真实参考图像、训练数据访问权限或对模型内部结构的白盒访问,严重限制了其实际应用。为此,本文提出一种完全黑盒(fully black-box)的行为探测机制(behavioral probe),仅通过分析模型在不同提示词下的输出行为即可区分生成图像是否为记忆结果,而无需任何参考图像或训练数据先验知识。该方案的关键在于设计了一种基于对抗性扰动与多轮生成响应模式分析的探针,能够有效捕捉模型对知名人物名称的特异性反应。为评估该方法,研究构建了NAMESAKES数据集,包含上千个公众人物姓名及其对应的真实面部图像,覆盖广泛知名度范围,并引入经过扰动的低知名度名称以增强测试鲁棒性。实验表明,该探针在主流T2I模型上能显著预测身份记忆现象,并成功将记忆型名称与未识别名称区分开来,同时揭示了不同模型家族在记忆行为上的差异。
链接: https://arxiv.org/abs/2606.20155
作者: Morris Alper,Vasudha Varadarajan,Moran Yanuka,Angelina Wang,Hadar Averbuch-Elor
机构: Carnegie Mellon University (卡内基梅隆大学); Tel Aviv University (特拉维夫大学); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires ground-truth photos, access to training data, or white-box access to model internals, limiting applicability. We introduce a fully black-box behavioral probe that distinguishes between these regimes while requiring no reference photos or prior knowledge of training data. To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names. Experiments on state-of-the-art T2I models show that our probe substantially predicts identity memorization and separates memorized from unrecognized names, with further insights into differences across model families.
[NLP-14] From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models
【速读】: 该论文旨在解决生成式人工智能在自动作文评分(AES)中广泛应用背景下,其内部表征机制缺乏清晰理解的问题。尽管大型语言模型(LLM)已显著提升自动评分性能,但其如何编码和表征作文质量这一核心信息仍不明确。本文通过系统分析八个不同LLM在三个数据集(ASAP++、CSEE 和 ENEM)上的隐藏表征,采用线性探测、跨提示泛化、降维及神经元层级分析等方法,发现作文质量信息以可线性解码的形式存在于模型表征中,且该信息随网络层数递进出现,对提示策略具有鲁棒性,并在不同评分标准下部分可迁移。此外,非线性探测仅带来微弱且不一致的性能提升,表明多数作文质量信息已具备线性可分性。研究还识别出若干“作文评分神经元”(essay scoring neurons),其激活强度与评分高度相关,且对特定干预敏感;这些神经元在不同长度作文中的分布随层次深度系统性变化,长作文更依赖深层表示。因此,该研究的关键贡献在于揭示了LLM中结构化的作文质量表征机制,为理解基于大模型的自动评分系统的可解释性提供了新视角。
链接: https://arxiv.org/abs/2606.20152
作者: Jiaxu Zuo,Mu You,Kaixin Lan,Tao Fang,Yujia Huo,Henghua Shen,Lidia S. Chao,Derek F. Wong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This is a preprint of a manuscript currently under peer review
Abstract:Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons’’ whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.
[NLP-15] HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization
【速读】: 该论文旨在解决生成式 AI 中注意力机制(Attention)在长上下文处理时因二次复杂度(quadratic complexity)带来的计算瓶颈问题。现有开源混合注意力模型多采用层间(layer-wise)设计,但先前研究指出将线性注意力(Linear Attention, LA)与全注意力(Full Attention, FA)融合存在内在困难,表明注意力混合的设计空间尚未充分探索。针对此问题,本文通过可解释性分析发现:不同层间呈现块状功能相似性,而同一层内的注意力头(head)虽共享输入特征却表现出显著的功能异质性。这一头级别异质性揭示了以头维度作为融合异构注意力信号的自然且合理的粒度。基于此洞察,论文提出 HydraHead 架构,其核心创新在于:(1) 基于可解释性的头选择策略,识别出对检索任务关键的注意力头,并仅对这些头保留全注意力;(2) 设计一种尺度归一化融合模块,有效弥合全注意力与线性注意力输出之间的分布差异。结合三阶段迁移训练流程与参数复用及知识蒸馏技术,实现了低训练开销下的高性能混合模型。实验表明,HydraHead 在统一训练设置下优于其他混合设计,在长上下文任务中表现优异,且保持强泛化推理能力;在 7:1 的线性注意力与全注意力比例下,性能可媲美 3:1 层级混合模型;尤其值得注意的是,仅使用 150 亿词元训练数据,其在 51.2 万上下文长度下相较基线提升超过 69%,接近同规模领先模型 Qwen3.5(原生支持 25.6 万上下文长度),凸显了头级别混合架构的巨大扩展潜力。
链接: https://arxiv.org/abs/2606.20097
作者: Zhentao Tan,Wei Chen,Jingyi Shen,Yao Liu,Xu Shen,Yue Wu,Jieping Ye
机构: Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid’s long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.
[NLP-16] Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自我审查与修订过程中是否存在对自身生成内容的偏好性保护问题,即模型是否倾向于拒绝合理且经验证有效的修改建议。这一问题的核心在于:当模型作为自身文本的评判者时,是否会因自偏见而抵制真正改进其输出的有效修正。为准确评估此现象,研究设计了一个基于确定性验证器(deterministic verifier)的实验场景——IFEval指令遵循修订任务,其中“有效”修改由官方检查器明确判定为合规且修复了约束违规。研究通过对比同一模型作为“作者”(即原生成者)与作为“新模型”(即无偏见的外部评审者)对同一修正建议的接受率,来检验是否存在自偏好偏差。关键发现是,在四个中等规模模型族、85组对比实验中,作者与新模型拒绝经验证有效的修正建议的比率差异仅为-5.1个百分点(95%置信区间[-12.9, +2.7]),表明未检测到显著的自偏好效应。此外,尽管小规模预实验提示可能存在“自我怀疑”倾向,但该现象在大规模测试中未能复现。唯一稳健的定性发现是:当作者拒绝经验证有效的修正时,97%的拒因归结于“缺陷捕捉”类理由(如语法或逻辑错误),而非主观偏好,说明拒绝行为更多源于对文本质量的严格审视,而非自我维护。研究结论表明,当前主流大语言模型在自我修订中表现出较强的自省能力,其对有效修正的接受度与外部评价者相当,暗示其自我修正机制具备较高的客观性与可信赖性。
链接: https://arxiv.org/abs/2606.20093
作者: William Guey,Pierrick Bougault
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 tables. Code and data: this https URL
Abstract:Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where “valid” is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.
[NLP-17] IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
【速读】: 该论文旨在解决波斯语预训练语言模型(Persian Pretrained Language Models, PLMs)因大规模高质量预训练语料稀缺以及评估任务局限于标准分类和命名实体识别(NER)等有限任务而面临的性能瓶颈问题。其核心解决方案在于构建并训练一个名为IHUBERT的单语波斯语PLM,该模型基于RoBERTa-base架构(125M参数),在经过精心筛选的45 GB Sepahr-Danesh语料子集(约7-80亿词元)上从头训练。为提升语料质量与多样性,研究设计了一套多阶段预处理流程,包括归一化、精确与近似重复数据剔除、匿名化处理,以及基于向量数据库的语义去重机制,以实现跨领域与语体的分布均衡控制。此外,针对波斯语丰富的形态变化与拼写变体,研究在完整预训练语料上训练了一个包含13.9万个词条的BPE分词器,有效提升了对语言复杂性的建模能力。实验表明,IHUBERT在七个波斯语自然语言理解(NLU)基准测试中表现优异,尤其在抽取式问答任务上显著领先(如在PQuAD上达到F1 88.3542,ParsiNLU-RC上达F1 49.0987),并在情感分析、主题分类、自然语言推理等任务中保持竞争力,仅在关系抽取任务上仍存在差距。通过控制变量的分词器消融实验进一步验证了BPE在相同词汇量下较WordPiece能减少子词碎片化,支持其分词设计的有效性。综上,IHUBERT通过语义优化的大规模预训练与涵盖分类与理解型任务的广泛评估,推动了波斯语语言建模的发展。
链接: https://arxiv.org/abs/2606.20089
作者: Arash Ghafouri,Mahdi Firouzmandi,Hossein Saberi,Mohammad Reza Hasani Ahangar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.
[NLP-18] What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
【速读】: 该论文旨在解决生成式模型中隐式链式思维(Latent Chain-of-Thought, Latent CoT)在训练过程中因结果层面监督信号薄弱而导致的语义漂移(semantic drift)问题,进而影响推理过程的可靠性。其核心挑战在于:传统优化路径中的梯度衰减与潜在空间中的表征漂移共同导致了“双重坍缩”(dual collapse),使得隐式推理轨迹失去对显式思维步骤的有效表征能力。为应对这一问题,作者从信息论视角出发,提出将过程监督分解为两个互补维度——轨迹监督(Trajectory Supervision)通过注入密集的阶段性推理信号增强学习信号密度,以及空间监督(Space Supervision)通过保持潜在流形的语义结构来约束表征演化。研究进一步发现,刚性几何压缩会压缩推理空间,而生成式重构则能提供更灵活的语义锚点,从而更好地维持信息容量。为此,作者设计了统一潜变量探测器(Unified Latent Probe, ULP),用于量化潜在轨迹与显式推理步骤之间的互信息(mutual information)。实验结果揭示了“信息-性能绑定”现象:推理准确性高度依赖于潜在链中所保留的信息保真度。因此,该研究提出了一个基于信息最大化而非几何模仿的潜式推理监督范式,为构建鲁棒的隐式推理系统提供了理论依据与实践指导。
链接: https://arxiv.org/abs/2606.20075
作者: Xinghao Chen,Chak Tou Leong,Wenjin Guo,Jian Wang,Wenjie Li,Xiaoyu Shen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \hrefthis https URLthis repository.
[NLP-19] Source-Grounded Data Generation for Text-to-JSON Learning
【速读】: 该论文旨在解决从金融文件、临床记录等长篇非结构化文档中可靠提取高价值信息并转化为可机器读取的结构化表示(如JSON)这一关键挑战。现有方法在构建可扩展、高质量的文本到JSON训练数据方面存在显著瓶颈,难以保证生成数据的准确性与一致性。其解决方案的关键在于提出STAGE(Spreadsheet-grounded Text-to-JSON Artifact GEneration),一种基于源数据(如电子表格)的合成数据生成管道:利用大语言模型(LLM)进行大规模文本与JSON模式的生成,同时通过底层电子表格中的真实值对生成结果进行严格验证,从而确保输出的准确性和真实性。实验表明,在自建的源基基准测试集STAGE-Eval(包含851个测试样本)上,STAGE显著优于现有方法,使Qwen3-4B模型在精确匹配率上从31.37%提升至74.27%,值级准确率从45.46%提升至90.69%,验证了该方案在提升训练数据质量与下游模型性能方面的有效性。
链接: https://arxiv.org/abs/2606.20072
作者: Sunghee Ahn,Guijin Son,Youngjae Yu
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.
[NLP-20] When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents
【速读】: 该论文旨在解决大语言模型(LLM)代理在自主选择工具时存在的权限敏感性安全问题,特别是“过度授权工具选择”(over-privileged tool selection)现象——即在存在足够低权限替代工具的情况下,代理仍会选择或升级至更高权限的工具。这一行为可能导致系统安全风险加剧,尤其是在临时工具失效后出现权限提升的连锁反应。现有研究多聚焦于无安全考量的元数据偏好,忽视了权限敏感决策的深层安全隐患。为此,作者提出ToolPrivBench基准测试框架,用于评估代理在八类应用场景及五种典型风险模式下对高权限工具的选择倾向,涵盖初始选择与临时失败后的权限升级行为。实验发现,主流LLM代理普遍存在过度授权选择问题,且临时故障会进一步放大该风险;同时,通用的安全对齐策略无法可靠促进最小权限原则的遵循,而提示工程层面的控制措施在动态场景中效果有限。针对此,论文提出一种权限感知的后训练防御机制(privilege-aware post-training defense),通过训练使代理优先选择满足需求的最低权限工具,并仅在必要时才进行权限升级。消融实验表明,该方法显著降低了不必要的高权限工具使用率,同时有效保留了代理的通用能力。
链接: https://arxiv.org/abs/2606.20023
作者: Kaiyue Yang,Yuyan Bu,Jingwei Yi,Yuchi Wang,Biyu Zhou,Juntao Dai,Songlin Hu,Yaodong Yang
机构: Institute of Information Engineering, Chinese Academy of Sciences; Beijing Academy of Artificial Intelligence; The Chinese University of Hong Kong; Institute for Artificial Intelligence, Peking University; School of Cyber Security, University of Chinese Academy of Sciences
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: code: this https URL
Abstract:As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.
[NLP-21] Connect the Dots: Training LLM s for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning DATE
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期运行环境中缺乏“连接点滴”(Connect the Dots, CoD)这一元能力的问题,即模型在持续部署过程中需具备通过任务求解与环境交互不断积累经验、迭代更新自身上下文表示,并基于更新后的上下文实现未来任务性能渐进提升的能力。其核心挑战在于如何设计一种能够支持长序列滚动(long rollout sequences)的端到端强化学习(end-to-end reinforcement learning, RL)框架,以协同优化任务求解(solve-task)与上下文更新(update-context)两个阶段。解决方案的关键在于:(1)提出一种基于GRPO风格的强化学习算法,结合细粒度信用分配机制,有效处理长时序依赖中的奖励稀疏性问题;(2)构建专门针对CoD元能力训练与评估的任务与环境,而非局限于特定领域或传统任务-任务式强化学习范式。实验结果验证了该框架在端到端强化学习下的有效性,并展现出在分布内、跨域以及从CoD设置向Ralph-loop设置迁移中的泛化潜力,为推动大语言模型向具备持续学习与自适应演化能力的长生命周期智能体发展提供了新路径。
链接: https://arxiv.org/abs/2606.20002
作者: Yanxi Chen,Weijie Shi,Yuexiang Xie,Boyi Hu,Yaliang Li,Bolin Ding,Jingren Zhou
机构: Alibaba Group(阿里巴巴集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress; we will continuously update the codebase and arXiv version
Abstract:This work presents a general framework for training large language models (LLMs) to “Connect the Dots” (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization – within the training domains, across different domains, and from CoD to Ralph-loop settings – of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \urlthis https URL.
[NLP-22] Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning
【速读】: 该论文旨在解决语音基认知障碍检测中因标注数据有限及跨数据集差异导致的模型泛化能力不足问题。其核心挑战在于如何在低资源条件下实现鲁棒且可扩展的语音表征学习。解决方案的关键在于提出一种分段级(segment-level)语音表示学习框架:将语音信号分割为短片段并转换为频谱图表示,结合离线与在线数据增强策略,利用自编码器(autoencoder)结构与对比学习(contrastive learning)目标共同优化潜在空间表示,从而在有限标注数据下生成更具判别性的特征表示。实验在四个独立的普通话语音数据集上验证了该方法在二分类和三分类任务中的稳定性能,尤其在临床中更具挑战性的三分类场景下表现显著提升,证明了该框架在资源受限临床环境中的实用价值。
链接: https://arxiv.org/abs/2606.19996
作者: Yongqi Shao,Hong Huo,Flavio Bertini,Danilo Montesi,Tao Fang
机构: Shanghai Jiao Tong University (上海交通大学); University of Parma (帕尔马大学); University of Bologna (博洛尼亚大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: 15 pages, 7 figures, 5 tables
Abstract:\noindent\textbfBackground and Objective: Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbfMethods: We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbfResults: Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbfConclusions: The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings. Comments: 15 pages, 7 figures, 5 tables Subjects: Sound (cs.SD); Computation and Language (cs.CL) Cite as: arXiv:2606.19996 [cs.SD] (or arXiv:2606.19996v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2606.19996 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yongqi Shao [view email] [v1] Thu, 18 Jun 2026 09:32:24 UTC (4,429 KB)
[NLP-23] GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLM s
【速读】: 该论文旨在解决生成式模型在推理阶段进行多方向语义控制时出现的模型崩溃问题,即当多个非正交的语义方向(semantic direction)同时注入且缺乏约束时,模型性能显著下降。其核心挑战源于两个独立作用的机制:一是分布偏差(distributional deviation),即逐层累加的扰动导致激活值超出训练数据分布;二是方向干扰(directional interference),即非正交的语义向量在叠加时相互抑制。为应对上述问题,论文提出了一种无需训练的多方向干预方法GEMS(Geometrically-constrained Multi-directional Steering),其关键在于将每个问题源映射为相应的几何约束:针对分布偏差采用保持范数的加权超叠加与目标注意力路径注入,针对方向干扰则引入实时正交化机制。实验表明,在GSM8K上同时注入三个非数学语义方向时,准确率仍维持在98%(基线为92%),而无约束叠加仅达4%;在Wikitext-2上,困惑度(PPL)仅增加2.2%。组件消融实验验证了各约束的因果作用,层级探针进一步证实正交化信号能有效通过前馈网络(FFN)路径并以语义特异性抵达输出分布。该方法在3B至31B规模模型间实现跨架构的定性控制效果迁移。
链接: https://arxiv.org/abs/2606.19946
作者: Yu Deng
机构: Qwen Team (Qwen团队)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 30 pages, 5 figures, 20 tables. Code and logs are available at: this https URL
Abstract:Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed without constraints, the model collapses. We show that this collapse decomposes into two independently acting sources: distributional deviation, where additive perturbations accumulate in norm across layers and drive activations outside the training distribution, and directional interference, where non-orthogonal semantic vectors mutually dampen when superposed. These two sources define the design constraints that any training-free multi-directional intervention must address. As one instantiation of these principles, we propose GEMS, a training-free method that maps each source to a corresponding geometric constraint: norm-preserving weighted superposition and targeted attention-pathway injection for distributional deviation, and real-time orthogonalization for directional interference. On GSM8K, injecting three concurrent non-mathematical directions preserves accuracy at 98% (baseline 92%), while unconstrained addition collapses to 4%; on Wikitext-2, the same injection incurs only 2.2% PPL increase. Component ablation isolates the causal role of each constraint, and layer-level probes confirm that orthogonalized signals survive the FFN pathway and reach the output distribution with semantic specificity. Qualitative steering effects transfer across architectures from 3B to 31B.
[NLP-24] Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal INTERSPEECH2026
【速读】: 该论文旨在解决自动发音评估(Automated Pronunciation Assessment, APA)中依赖标注错误数据或非母语语料库所带来的高成本问题。现有方法通常需要大量人工标注的发音错误数据或复杂的非母语语音语料,而这些资源在实际应用中获取困难且代价高昂。为应对这一挑战,本文提出一种轻量级无监督/弱监督框架,仅基于母语语音资源进行训练,并在推理阶段通过自编码器与K-means码本对学习者语音进行离散化处理。利用在母语序列上训练的词元语言模型计算音位结构突现性(surprisal),其值越高表示发音偏离母语音系规则的程度越大。此外,引入一种受转录引导的Text2DUnit–DTW模块,从参考文本预测母语词元序列,并与声学词元对齐,以提取对错误敏感的特征。最终通过简单的回归融合突现性与对齐特征,实现发音评分。实验表明,在SpeechOcean762数据集上,加入转录引导后皮尔逊相关系数(PCC)从0.60提升至0.66,接近有监督基线性能;在跨数据集评估(L2-ARCTIC)中也展现出一致的性能提升。该方案的关键在于结合无监督音位结构分析与转录引导的对齐机制,实现了无需大规模标注即可有效评估发音质量的高效建模。
链接: https://arxiv.org/abs/2606.19910
作者: Syeda Faiza Ahmed Sara,Shammur Absar Chowdhury
机构: Qatar Computing Research Institute (卡塔尔计算研究学院)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2026
Abstract:Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit–DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.
[NLP-25] REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection
【速读】: 该论文旨在解决个人身份信息(PII)检测领域缺乏系统性、可复现且多语言支持的基准评估基础设施的问题。现有数据集存在实体类型覆盖不足、生成条件随意以及无法识别导致检测器失效的表面形式特征等局限。为应对这一挑战,研究提出REDACT——一个经过严格控制的多语言PII基准,包含13,427条记录、324,078个实体标注、51种实体类型、4,127种表面形式模式及跨9种文字系统的25种语言。其核心创新在于采用强度为2的覆盖数组采样方法,系统地控制九个生成维度(领域、格式、难度、长度、密度、代码切换、语言、邻近性与共现性),确保数据分布的全面性和可控性。此外,引入三个实体级元数据字段(披露状态、披露形式、符合GDPR的敏感度层级),支持基于敏感度层级的分层评估,超越传统的整体或按类型计算的F1指标。在对五种主流检测器(Presidio、GLiNER、OpenAI隐私过滤器、GPT-4.1和Claude Sonnet 4.6)进行锁定语言分层样本(1,000条记录)评估后发现,传统规则基检测器在高敏感度类别(如高敏感度层级)和非字面披露形式上表现极差(召回率仅0.07),而大模型(LLM)检测器则表现出更强鲁棒性,尤其在高敏感度层级上表现最优。通过三模型无参考的LLM作为裁判评估进一步验证,敏感度层级划分是任务中最具挑战性的维度。研究成果包括完整的基准数据集、数据模式、提示模板及分层评估工具包的开源发布。
链接: https://arxiv.org/abs/2606.19881
作者: Guneesh Vats,Anubha Agrawal,Shikha Singhal,Ajita Dash,Praison Selvaraj,Vidhan Jhawar,Ranga Prasad Chenna,Bharadwaj Y M G
机构: ServiceNow
类目: Computation and Language (cs.CL)
备注: 14 pages, 5 figures
Abstract:Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector failures. We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts. A strength-2 covering-array sampler controls nine generation axes: domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence. Three entity-level metadata fields (disclosure status, disclosure form, and a GDPR-aligned sensitivity tier) enable stratified evaluation beyond aggregate or per-type F1. From the full benchmark, we evaluate five detectors (Presidio, GLiNER, the OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a locked, language-stratified sample of 1,000 records. Aggregate F1 masks an architecture-dependent failure structure: the rule-based detector performs poorly on the highest-stakes data, including HIGH-sensitivity categories (recall 0.07) and non-verbatim disclosure forms, while the LLM detectors remain more robust, with the HIGH tier as their strongest sensitivity slice. A three-model reference-free LLM-as-judge assessment corroborates that sensitivity-tier assignment is the task’s hardest axis. We release the benchmark, schema, prompts, and stratified evaluation harness.
[NLP-26] he Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI
【速读】: 该论文旨在解决生成式 AI(Generative AI)在民主协商中应用时所面临的语言约束、偏见及顺从性倾向等系统性问题,尤其关注如何通过大语言模型(Large Language Models, LLMs)实现协商过程的规模化与民主化,以促进包容性并赋能传统上被边缘化的群体。其解决方案的关键在于引入系统功能语言学(Systemic-Functional Linguistics)的理论框架,分析不同使用者(如社会人口学群体)与不同语言使用情境(如交际功能)之间的差异如何影响人工智能支持下的参与质量。通过实证研究验证了基于AI的协商机制在搭建论证结构、提升可及性以及削弱主流语域中固有的排他性语言规范与偏见方面的潜力。同时,论文强调需避免过度宣传导致不切实际的期望,也警惕低估技术价值而错失人机协同参与的机遇,并呼吁未来研究应聚焦于最大化人工智能辅助参与的民主潜能,同时嵌入伦理保障机制,防止语言不平等的再生产。
链接: https://arxiv.org/abs/2606.19864
作者: Serge Sharoff
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published in /Handbook of Democracy in the Era of Artificial Intelligence/ edited by Evangelos Pournaras, Srijoni Majumdar, Carina Ines Hausladen, and Dirk Helbing. 2026
Abstract:The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns persist regarding linguistic constraints, biases, and the sycophantic tendencies of LLMs. This chapter explores how LLMs can be used to significantly scale up and democratise deliberation, particularly in fostering inclusivity and empowering traditionally marginalised groups. Drawing on concepts from Systemic-Functional Linguistics, the chapter examines how variations across language users (for example, with respect to socio-demographic groups) and across language use (for example, with respect to communicative functions) shape participation in AI-supported deliberation. The chapter presents AI-driven deliberation studies and assesses their potential to scaffold argumentation, enhance access, and reduce the influence of exclusionary linguistic norms and biases which are embedded in prestigious registers. At the same time, the chapter cautions against both overclaiming, which leads to unrealistic expectations, and underclaiming, which risks missed opportunities for AI-assisted engagement. The chapter concludes by identifying future research directions to maximise the democratic potential of AI-assisted participation while embedding ethical safeguards to counteract the reproduction of linguistic inequalities.
[NLP-27] Large Language Models Do Not Always Need Readable Language
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在交互过程中过度依赖人类可读的自然语言表示所带来的效率瓶颈问题,特别是当输入输出目标为其他模型而非人类时,传统自然语言形式存在信息冗余与上下文开销过高的缺陷。其核心解决方案在于提出一种名为BabelTele的模型中心型文本表征方法——通过牺牲人类可读性,将语义信息编码于紧凑、非标准的文本形式中,同时确保指令微调后的LLM仍能有效恢复并理解其中的核心语义。BabelTele的关键创新在于实证验证了语义可恢复性与人类可读性之间的部分解耦可能性:即使文本长度压缩至原始长度的27.9%,仍可保持高达99.5%的语义保真度,并在跨模型迁移、智能体记忆及多智能体通信等任务中表现出较低的上下文开销与可靠的下游性能。研究结果表明,模型侧的语义解析能力独立于自然语言典型性,为未来构建面向模型原生的高效表示范式提供了可行路径。
链接: https://arxiv.org/abs/2606.19857
作者: Jiayi Zhu,Haoxuan Peng,Junxi Wang,Liang Ke,Chen Zhang,Linfeng Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 10 figures. Preprint
Abstract:Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model-centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs’ capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction-tuned LLMs. As a task-agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross-model transfer, agent memory, and multi-agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor-reader pair and task setting. These findings indicate that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path toward model-native representations in future exploration of LLM systems.
[NLP-28] Prompt Plan Extract: Zero-Shot Agent ic LLM s Workflows for Lung Pathology Extraction from Clinical Narratives ALT
【速读】: 该论文旨在解决病理报告中关键信息提取的难题,尤其针对肺癌根治术病理报告中结构化数据难以自动获取的问题。由于重要信息多嵌入于非结构化的叙述性文本中,传统人工提取方式不仅耗时费力且易出错。尽管已有基于监督学习的自然语言处理(Natural Language Processing, NLP)方法通过命名实体识别(Named Entity Recognition, NER)与关系抽取(Relation Extraction, RE)实现自动化,但其依赖昂贵的人工标注,并在上游实体遗漏时导致级联错误。为此,本文提出一种零样本(zero-shot)、代理型(agentic)工作流,评估五种开源生成式大语言模型(Generative Large Language Models, LLMs),以自动填充13个美国病理学家学院(College of American Pathologists, CAP)推荐的结构化字段。研究采用新型符合肿瘤登记标准的评估框架,对比了该方法与先进的监督式基准模型GatorTron NER-RE(Micro-F1为0.960)。结果显示,最佳零样本模型GPT-OSS-20B达到Micro-F1 0.893(召回率0.949),能够无需任务特定训练即准确捕获如病理分期等复杂关系。因此,该研究的关键突破在于证明:开源、零样本、代理驱动的生成式大模型可作为低成本、高效率的解决方案,有效实现肺癌病理报告中结构化信息的自动提取。
链接: https://arxiv.org/abs/2606.19852
作者: Aman Pathak(1),Cheng Peng(1),Mengxian Lyu(1),Ziyi Chen(1),Reema Solan(1),Sankalp Talankar(1),Yasir Khan(1),Hiren Mehta(2),Aokun Chen(3),Yi Guo(1),Yonghui Wu(1)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, 3 tables. Affiliations: (1) Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; (2) Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA; (3) College of Nursing, Florida State University, Tallahassee, FL, USA
Abstract:Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.
[NLP-29] AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮交互中因固定上下文窗口导致的长期信息积累与复用能力受限问题。现有记忆增强系统普遍存在记忆构建粗粒度、不稳定性高,依赖低效的记忆表示或无约束更新机制等缺陷。其解决方案的关键在于提出AtomMem,一种面向高价值信息存储与稳定演化的长期记忆系统:通过引入事实执行器(Fact Executor),从长文本交互中选择性提取高价值原子事实,作为高效的记忆表征;进而将这些事实组织为层次化事件结构与时间轨迹,以捕捉连贯的事件上下文并动态追踪用户属性演化;在检索阶段,利用关联记忆图激活碎片化记忆间的潜在联系。实验结果表明,AtomMem在LoCoMo基准测试中实现了多项推理任务的最先进性能,为部署可扩展、经济高效的个性化智能代理提供了可行方案。
链接: https://arxiv.org/abs/2606.19847
作者: Yanyu Yao,Shangze Li,Zhi Zheng,Hui Zheng,Qi Liu,Tong Xu,Enhong Chen
机构: University of Science and Technology of China, Hefei, China; Anhui University, Hefei, China
类目: Computation and Language (cs.CL)
备注: 19 pages, 10 figures, 5 tables
Abstract:Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long-term memory system designed for value-dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.
[NLP-30] Leverag e Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models
【速读】: 该论文旨在解决生成式语言模型中单神经元干预(single neuron intervention)在控制特定行为时为何常出现输出崩溃而非一致可控的问题,即缺乏理论预测何种干预可实现连贯行为调控。其核心解决方案是提出“预算归一化控制窗口”(budget normalized control window)框架,关键在于将干预效果简化为单一控制坐标:残差流(residual stream)与写入方向之间的对齐度,该坐标遵循由残差范数与写入范数比值决定的通用饱和曲线(universal saturation curve),构成“相干性预算”(coherence budget)。当行为触发点低于崩溃上限(collapse ceiling)时,控制才可保持连贯;该上限由模型权重和一次前向传播确定,而触发点则通过推理过程测量。在15个未见神经元上的实验表明,预测的崩溃上限平均绝对误差仅为0.14(批量层中约0.07),且11个判断结果优于15个中的多数基线。封闭案例揭示三种失败模式:干预前即发生崩溃、深度不足导致信号无法传播、或归一化机制限制了单神经元推动能力。该理论还解释了局部梯度归因为何会错误预测控制效果——真正控制器沿读出轴写入,其一阶梯度接近零。通过引入仅依赖前向传播的对比性筛选(forward-only contrastive screen),可精准识别被传统归因遗漏的控制器。在拒绝行为(refusal)这一最复杂情形中,干预成功具有类型区分性而非标量性质:存在“连贯绕过”(coherent bypass)与“严格可行动达”(strict actionable reach)两种不同状态,前者可在无实质性内容的流畅文本中翻转拒绝,而后者仅在六组审计的Llama枢纽点中有三例出现,且需更晚的推理时序。因此,单神经元操控本质上是一种受预算约束、具备类型区分的可控制性审计,而非固定剂量的经验性操作。
链接: https://arxiv.org/abs/2606.19831
作者: Hongliang Liu
机构: Palo Alto Networks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.
[NLP-31] JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines
【速读】: 该论文旨在解决当前生成式AI在游戏开发领域中长期忽视的项目级代码工程问题,即在专业游戏引擎(如Godot)上进行大规模、可复现的游戏项目代码生成与评估难题。其核心挑战在于缺乏高质量的大规模数据集和确定性的评估方法。为此,研究提出JamSet与JamBench——首个基于专业游戏引擎构建的项目级游戏代码框架数据集与基准评测体系。解决方案的关键在于利用“游戏创作马拉松”(Game Jam)这一社区活动所产生的海量开源项目,结合Godot引擎的文本化项目结构与无头执行模式,设计了一套从文件完整性校验到运行时行为采集的确定性验证流程,从中筛选出8,133个经过验证的项目,其中300个经人工审核构成JamBench基准,其余为训练用的JamSet数据集。JamBench定义了以主题驱动的生成与代码补全任务,并通过编译通过率、结构完整度评分(Structural Completeness Score, SCS)和行为对齐度评分(Behavioral Alignment Score, BAS)多维度评估模型性能。实验表明,随着项目规模增大,现有前沿模型能力出现显著下滑,大型项目运行通过率由小型项目的80.4%骤降至5.7%;尽管代码代理(Code Agents)能提升编译成功率,但对运行时行为质量无改善,揭示出当前瓶颈在于架构设计而非语法正确性。此外,实验证实JamSet作为训练数据具有显著有效性,所有数据与代码均已公开。
链接: https://arxiv.org/abs/2606.19830
作者: Jianwen Sun,Chuanhao Li,Zizhen Li,Yukang Feng,Fanrui Zhang,Yifei Huang,Yu Dai,Kaipeng Zhang
机构: Nankai University (南开大学); Shandong University (山东大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine’s text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.
[NLP-32] CREDENCE: Claim Reduction for Decomposition Enhanced Credibility – Semantic Metrics and Convergence Analysis
【速读】: 该论文旨在解决复合句分解(claim decomposition)在自动化事实核查中的两大核心问题:一是现有方法依赖词元重叠(Jaccard)度量对语义重构(paraphrastic)的断言存在系统性低估,导致分解质量评估不准确;二是修复循环(repair loop)缺乏形式化的终止性分析,难以保证算法收敛性。其解决方案的关键在于提出Credence框架,通过引入语义F1(Semantic-F1),采用BGE-large模型的余弦相似度作为保真度度量,有效缓解了Jaccard度量对语义等价断言的惩罚问题,显著提升下游事实核查的准确性。同时,论文建立了收敛性定理,形式化证明了基于规则的修复流程在理想解析器假设下具有单调性和有限终止性,而基于大语言模型(LLM)的自修复机制则为非单调,需引入早停保护机制以确保稳定性。此外,研究构建了覆盖社交媒体、百科和新闻领域的三个评估基准,实现了跨域泛化能力的量化分析,并在多个模型(3.8B–12B参数规模及闭源API模型)上进行了多模型对比实验。实验结果表明,Semantic-F1相较Jaccard-F1性能提升15–32个百分点,修复后精确率(EPR)达到0.94–1.00,且基于规则的修复可使原子性违反率(AVR)降低47%–100%,同时保持高保真度。
链接: https://arxiv.org/abs/2606.19819
作者: Phuong Huu Vu Tran,Thuan Duc Mai,Bach Xuan Le
机构: Ho Chi Minh City University of Technology (HCMUT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 40 pages, 6 figures, 19 tables. Submitted to Language Resources and Evaluation
Abstract:Decomposing compound sentences into atomic, verifiable claims is a prerequisite for reliable automated fact-checking. Prior work has relied on token-overlap (Jaccard) metrics that systematically underestimate decomposition quality for paraphrastic claims, and has lacked formal termination analysis for the repair loop. We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings. Our contributions are: (1) Semantic-F1: we use BGE-large cosine similarity fidelity metric that resolves Jaccard’s penalisation and improves downstream fact-checking accuracy; (2) Convergence theorems: we formally characterise four properties of the repair pipeline, establishing that rule-based repair is monotone and finitely terminating under an oracle parser assumption; LLM-based self-repair is provably non-monotone and requires an early-exit guard; (3) Three evaluation benchmarks spanning social-media, encyclopaedic, and news domains for cross-domain generalisation measurement; (4) Multi-model benchmarking across four decomposer models (3.8B-12B) and a closed API model. Experiments on SocialClaimSplit, WikiSplitBench, and ClaimDecompBench show that Semantic-F1 outperforms Jaccard-F1 by +15-32pp. EPR ranges from 0.94 to 1.00 on SocialClaimSplit and WikiSplitBench, while ClaimDecompBench includes lower base EPR cases (down to 0.824) due to harder news-domain constructions, and rule-repair reduces the Atomicity Violation Rate (AVR) by 47-100% relative to the base model without degrading fidelity.
[NLP-33] Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability
【速读】: 该论文旨在解决预训练语言模型(如BERT)在文本分类任务中表现优异但缺乏透明性,难以应用于高风险场景的问题;同时针对传统可解释的命题逻辑模型——命题机(Tsetlin Machine, TM)语义表征能力弱、无法捕捉上下文语义信息的缺陷。其核心解决方案是提出一种无需依赖静态词嵌入的语义预训练框架,通过将文本样本聚类为语义一致的簇(采用K-means或Top2Vec),利用聚类-样本对构建非否定型命题机,并引入增强型类型I反馈机制进行预训练。该方法使命题机能够学习到具有可解释性的语义关键词,并在下游任务中实现性能优化。实验表明,该方法在五个数据集上显著优于原始及基于嵌入的命题机,且性能接近BERT,同时保持了完全可解释性。
链接: https://arxiv.org/abs/2606.19815
作者: Jiechao Gao,Rohan Kumar Yadav,Yuangang Li,Yuandong Pan,Jie Wang,Ying Liu,Michael Lepech
机构: Stanford University (斯坦福大学); University of California, Irvine; University of the Chinese Academy of Sciences
类目: Computation and Language (cs.CL)
备注:
Abstract:Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine ™ offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.
[NLP-34] hink Again or Think Longer? Selective Verification for Budget-Aware Reasoning
【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理阶段部署时存在的效率与可靠性矛盾问题:过度使用测试时推理(test-time reasoning)可能导致计算资源浪费、对已正确答案的无效干预,甚至引入有害的答案变更。传统方法将此视为“新验证器”问题,而本文将其重新定义为一个部署资源分配问题。其核心解决方案是提出 \sevra(Selective Verification for Reasoning Allocation),一种位于服务层的控制器,根据可观察的初始推理状态动态决定是否保留冻结求解器(frozen solver)的原始输出,或触发主动验证。通过在冻结的 Qwen3-4B 求解器上收集干预结果并训练具备可恢复性感知能力的门控机制,\sevra 实现了精准的推理资源调度。实验表明,在 \mathfive 上,该方法以 26.8% 的后生成词元减少和有害翻转率从 2.2% 降至 1.0% 的代价,将准确率提升至 76.3%,优于始终验证策略;在 \gsm 任务中,仅 3.0% 的样本被验证,准确率从 93.4% 提升至 94.5%,验证词元消耗降低 91.2%。结果表明,选择性恢复虽有效,但最优部署仍需优先调整初始预算,仅在需要显式检查、有限重试、可审计性或回归风险控制等场景下启用。
链接: https://arxiv.org/abs/2606.19808
作者: Sajib Acharjee Dip,Dawei Zhou,Liqing Zhang
机构: Virginia Tech(弗吉尼亚理工学院); Fralin Biomedical Research Institute, Virginia Tech(弗吉尼亚理工学院弗拉林生物医学研究所); FBRI Cancer Research Center, Washington, DC(弗吉尼亚理工学院弗拉林生物医学研究所癌症研究中心,华盛顿特区)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver’s initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3% accuracy, compared with 75.5% for always verifying, while reducing post-generation tokens by 26.8% and harmful flips from 2.2% to 1.0%. However, an 8,192-token initial solve reaches 76.0% accuracy with 28% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0% of examples, improves accuracy from 93.4% to 94.5%, and reduces verification tokens by 91.2% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.
[NLP-35] CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在组合计数(combinatorial counting)任务中表现脆弱且缺乏系统性评估的问题。现有基准多为静态数据集,难以控制变量并揭示模型在复杂推理场景下的失败模式。为此,论文提出CombEval,一个动态的组合计数评估基准,通过类型化的Cofola规范形式化描述问题中的实体、组合对象、对象依赖关系及约束条件,支持在实体规模、对象类型、约束数量和推理深度等维度上进行系统性变异,从而生成具有精确求解器验证答案的自然语言计数问题。其解决方案的关键在于:(1)构建可编程、可扩展的规范化表达框架,实现对计数问题结构的精确控制;(2)提供直接提示与代码增强两种评测设置,全面评估模型在复杂组合逻辑中的推理能力。实验表明,当前11个主流LLMs在处理有序对象、不可区分元素、相对位置约束及嵌套依赖关系时仍存在显著缺陷,且错误主要源于约束理解偏差与计数原理误用。CombEval为诊断大模型在组合推理中的失效机制提供了可复现的测试平台,其代码与生成的基准套件已公开。
链接: https://arxiv.org/abs/2606.19788
作者: Yuxu Zhou,Ondřej Kuželka,Yuyi Wang,Yuanhong Wang,Yi Chang
机构: Jilin University (吉林大学); Czech Technical University in Prague (捷克技术大学); CRRC Zhuzhou Institute (中车株洲所); Tengen Intelligence Institute (腾云智研院); International Center of Future Science, Jilin University (未来科学国际中心,吉林大学); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China (知识驱动人机智能工程研究中心,教育部,中国)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review. Code: this https URL
Abstract:We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \urlthis https URL.
[NLP-36] Agent FinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA
【速读】: 该论文旨在解决在受监管金融环境中进行图表问答(Chart Question Answering, CQA)时面临的双重挑战:一是现有模型普遍缺乏可审计性(auditability),难以让从业者判断答案的可信度;二是多数系统依赖外部API,无法满足数据本地化(on-premise deployability)要求,尤其在客户数据不能外泄的合规场景下。针对这些问题,论文提出AgentFinVQA,一个基于多智能体架构的可审计、可本地部署的金融图表问答系统。其解决方案的关键在于构建一个端到端的可追溯流程,将每个查询分解为规划、光学字符识别(OCR)、图例定位、视觉检查与验证五个步骤,并通过“模型评估包”(Model Evaluation Packet, MEP)完整记录每一步操作,实现决策过程的透明化与可复现性。实验表明,AgentFinVQA在FinMME基准上相较于基线模型分别提升7.68个百分点(使用Gemini-3 Flash)和4.84个百分点(使用本地部署的Qwen3.6-27B-FP8),同时验证器输出的置信度信号显著提升了人工审核效率(确认答案准确率68.2% vs. 修订答案55.6%)。错误分析进一步揭示,问题理解偏差、图例混淆与信息提取错误占失败案例近三分之二,且是验证器最难检测的类别,指明了未来改进方向。研究证实,具备审计能力的本地化金融图表问答不仅可行,且开源权重模型在保持绝大部分精度优势的同时,实现了全数据主权保障。
链接: https://arxiv.org/abs/2606.19782
作者: Aravind Narayanan,Shaina Raza
机构: Vector Institute
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves +7.68 pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar p \approx 1.1 \times 10^-16 ), and +4.84 pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier’s verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.
[NLP-37] Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练过程中,因问题采样策略不合理而导致的训练效率低下问题。现有自适应课程学习方法通常将问题选择建模为独立臂的标准化多臂老虎机问题,忽视了任务空间中存在的结构化与异质性特征。为此,本文提出将问题采样建模为具有内生非平稳性的流形结构多臂老虎机问题:任务间通过模型隐空间表示相互关联,且采样决策可引导学习信号在该隐空间中的演化路径。其解决方案的关键在于提出贝叶斯流形课程(Bayesian Manifold Curriculum, BMC),该框架通过构建层次化任务树来组织问题,并利用贝叶斯学习机制实现对采样过程的结构感知引导。实验表明,不同采样策略在生产率(学习信号强度)、多样性(任务流形覆盖范围)和实用性(评估相关性)之间存在显著权衡,单纯优先选择中等难度问题不足以获得优异下游性能,强调了在问题采样中融合任务结构与类型感知的重要性。
链接: https://arxiv.org/abs/2606.19750
作者: Darrien McKenzie,Nicklas Hansen,Xiaolong Wang
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Webpage: this https URL
Abstract:Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model’s latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.
[NLP-38] Benchmarking Agent ic Review Systems
【速读】: 该论文旨在解决生成式 AI(Generative AI)辅助科研背景下,传统同行评审系统所面临的压力问题,核心在于如何有效评估新兴的智能代理型评审系统(agentic review systems)的性能。其关键解决方案是构建并应用两个综合性评估基准:一是基于外部信号(如引用量和录用决策)的论文质量判别能力测试,用于衡量AI评审是否能准确反映论文的真实质量;二是针对已知错误注入的扰动基准(perturbation benchmark),以检验系统对实质性错误的检测召回率。研究发现,最佳配置(OpenAIReview + GPT-5.5)在二元比较任务中达到83.0%的准确率,在错误检测方面召回率达71.6%,且多模型联合检测可提升至83.3%召回率,表明不同模型具有互补性。此外,真实用户部署实验显示系统获得正向反馈(评论评分比为1.44:1),尽管存在误报和琐碎批评等局限。总体而言,研究证明当前基于前沿大模型的智能评审系统虽仍有改进空间,但已具备良好质量判别能力、关键错误识别能力和实际可用性。
链接: https://arxiv.org/abs/2606.19749
作者: Dang Nguyen,Wanqing Hao,Yanai Elazar,Chenhao Tan
机构: University of Chicago (芝加哥大学); Bar-Ilan University (巴伊兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 7 tables, 4 figures
Abstract:A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers’ quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.
[NLP-39] NRITYAM: Language Models Meet Art and Heritage of Dance ECML KDD’26
【速读】: 该论文旨在解决当前语言模型在全球化应用中对本地社会文化语境理解不足的问题,尤其聚焦于传统舞蹈艺术这一跨文化知识密集型领域。其核心挑战在于现有语言模型普遍缺乏对多元文化背景下的符号、仪式与历史内涵的深层理解,导致在涉及文化敏感性任务时表现不佳。为此,论文提出NRITYAM——一个面向全球舞蹈传统的综合性评估基准,包含9,260个经过精心设计的多语言问答对,覆盖12种语言,是目前规模最大、专用于评估舞蹈文化知识的语言模型基准数据集。其解决方案的关键在于通过与本土舞者及母语者的深度协作,从源头构建具有文化真实性和地域特异性的数据内容,确保问题与答案在文化语境上的准确性与合理性。该基准不仅涵盖单模态与多模态大模型、小模型等多种架构,更以多语言、多文化视角推动了对生成式AI(Generative AI)在传统表演艺术理解与推理能力方面的系统性评估,为提升AI系统的文化共情力与跨文化认知能力设立了新的标准。
链接: https://arxiv.org/abs/2606.19727
作者: Punit Kumar Singh,Niladri Ghosh,Advait Joshiınst,Shailee Choudhary,Michael Färber,Haiqin Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 12 figures, in ECML_PKDD’26
Abstract:Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\urlthis https URL.
[NLP-40] FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs
【速读】: 该论文旨在解决在非法人口走私网络分析中,如何从非结构化、术语密集的法律文书(如法庭审理记录)中高效提取高质量实体与关系信息的问题。现有基于通用大语言模型(Large Language Models, LLMs)的方法因缺乏领域适配性,难以准确识别该领域特有的实体类型(如“蛇头”“运输路线”)与复杂关系,导致知识图谱质量受限。其解决方案的关键在于提出FineREX——一个以领域微调的大语言模型为核心的精简知识图谱构建流程,专用于命名实体识别与关系抽取(Named Entity Recognition and Relationship Extraction, NER-RE)。通过在512个手工标注文本片段上进行微调,FineREX在实体和关系的F1分数上分别较大型通用基线模型提升15.50%和31.46%,显著降低了法律语境中的噪声干扰,并将长文档中节点重复率从17.78%降至11.17%。此外,通过去除冗余的文档重写与重复抽取环节,整体处理时间缩短50.0%。研究表明,针对特定领域进行微调的模型不仅能超越更大规模的通用模型,还能在知识图谱的质量与构建效率方面实现双重优化。
链接: https://arxiv.org/abs/2606.19710
作者: Elijah Feldman,Dipak Meher,Carlotta Domeniconi
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); University of Texas at Dallas (德克萨斯大学达拉斯分校); University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code available at this https URL
Abstract:Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents. While large language models (LLMs) can support knowledge graph construction through automated information extraction, existing approaches rely on general-purpose models that are not tailored to the entity and relationship definitions required in this domain. We introduce FineREX, a streamlined knowledge graph construction pipeline built around a fine-tuned LLM for named entity recognition and relationship extraction (NER-RE). Using a manually annotated dataset of 512 text chunks, FineREX achieves absolute improvements of 15.50% and 31.46% in entity and relationship F1-score, respectively, compared to a larger general-purpose baseline. These gains translate into higher-quality knowledge graphs, reducing legal noise by nearly half and lowering node duplication on long documents from 17.78% to 11.17%. By eliminating document rewriting and redundant extraction stages, FineREX also reduces end-to-end processing time by 50.0%. Our results demonstrate that domain-specific fine-tuning can substantially outperform larger general-purpose models while improving both the quality and efficiency of knowledge graph construction for illicit network analysis.
[NLP-41] NEST: Narrative Event Structures in Time for Long Video Understanding
【速读】: 该论文旨在解决当前视觉语言模型在处理长视频时,虽能处理长序列输入但缺乏对叙事结构深层理解的问题。现有长视频评估基准多聚焦于“大海捞针”式的片段检索任务,未能有效评估模型对低层动作如何构成事件、事件间的时间交互关系以及叙事进程的把握能力,例如能否识别早期挫折(如失业)与后期事件(如情感破裂)之间的长期因果关联,即使其间存在长时间跨度、插入场景或闪回重构。为此,论文提出了NEST(Narrative Event Structures in Time for Long Video Understanding)数据集,包含1005部完整电影(平均时长约98分钟),每部影片标注了102个基于视觉内容、对话和音频的多模态叙事事件,并通过时间顺序、层次组合及长程依赖等关系链接这些事件,以刻画真实的叙事结构。其关键解决方案在于构建一个具有结构化多模态事件标注与叙事关系连接的高质量基准,推动对事件触发检测(ETD)、事件定位(EL)、事件论元抽取(EAE)和事件关系抽取(ERE)的系统性研究。实验表明,该任务极具挑战性:在无监督条件下,ETD性能低于8%,EL低于6%,EAE低于11%;而一旦事件已知,ERE表现显著提升,零样本下达到35.45% F1,微调后达44.42% F1,凸显了事件发现与关系推理之间的巨大差距。
链接: https://arxiv.org/abs/2606.19706
作者: Ali Asgarov,Kaushik Narasimhan,Najibul Haque Sarker,Hani Alomari,Chia-Wei Tang,Anushka Sivakumar,Zaber Ibn Abdul Hakim,Shaurya Mallampati,Chris Thomas
机构: Virginia Tech (弗吉尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.
[NLP-42] rraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature
【速读】: 该论文旨在解决从海量非结构化火星科学文献中高效提取可用于人类宜居性评估与未来地球化(terraforming)研究的定量知识这一关键问题。其核心挑战在于如何将分散在学术论文中的复杂、非标准化信息转化为机器可读的结构化数据,以支持数字孪生和气候建模等下游应用。解决方案的关键在于构建一个端到端的信息抽取流水线TerraMARS,该系统采用领域适配的小型语言模型(Small Language Model),通过量化低秩适应(QLoRA)微调技术对Google Gemma 3 1B模型进行领域定制,使其能够精准回答火星地球化相关问题并完成从非结构化文本到JSON格式结构化输出的转换。该方法结合多阶段检索与分块框架处理开放获取论文语料库,实现了科学文献知识的有效提取与集成,为构建基于真实科研数据的火星宜居性分析系统提供了可行路径。
链接: https://arxiv.org/abs/2606.19700
作者: Jyotsna Singh,Ash Black,Jeff Larsen,Scott R. Saleska
机构: College of Information Science, University of Arizona, Tucson, AZ, USA; Biosphere 2, University of Arizona, Tucson, AZ, USA; Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA; Department of Environmental Sciences, University of Arizona, Tucson, AZ, USA
类目: Computation and Language (cs.CL)
备注: 16 pages, 1 figure, 4 tables
Abstract:Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet’s atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.
[NLP-43] What sentiment analysis cant see: Measuring whether customers were helped and what went wrong across 70000 support conversations
【速读】: 该论文旨在解决传统情感分析(Sentiment Analysis)在客户支持数据解读中仅关注语言语气而忽视实际满意度的问题。其核心解决方案在于引入基于大语言模型(LLM)的多维度标注方法,利用GPT-5.4同时估计客户的满意度、识别是否报告具体问题,并结合人工评分验证。结果显示,该方法在预测客户1至5分评价时的相关性达到0.47,显著优于传统情感分析的0.36,且误报率更低。更重要的是,该方案揭示了情感与满意度在44%的对话中存在分歧,单一“中性”标签掩盖了从隐性满意到无声放弃的复杂状态,尤其识别出“容忍摩擦”(tolerated friction)这一关键群体——即虽满意但持续报告可修复问题的用户,此类信息无法通过依赖语气的情感仪表盘发现。研究表明,基于大语言模型的结构化标注能够超越语言表层情绪,直接从原始文本中提取客户状态(如满意度)及问题根源,为构建以客户真实体验为基础的新业务指标提供了强大潜力。
链接: https://arxiv.org/abs/2606.19698
作者: Jason Potteiger
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 6 figures
Abstract:Most companies read their customer support data at scale using sentiment analysis, which measures how customers sound rather than whether they were satisfied with the result. We tested a richer alternative on 70,450 support conversations from a leading online fundraising platform: alongside tone, we used GPT-5.4 to estimate each customer’s satisfaction and to flag whether they reported a concrete problem, then validated all three readings against the 1-to-5 ratings customers left on the conversations they rated. The satisfaction estimate tracked those ratings far better than sentiment did, correlating at 0.47 against 0.36 and flagging unhappy customers with far fewer false alarms. The structured read also sees what sentiment cannot: tone and satisfaction disagree in 44% of conversations, a single “Neutral” label hides everything from quietly satisfied customers to ones who quietly gave up, and the largest group of all is “tolerated friction,” customers who are satisfied but still reporting a fixable problem, a standing issue that no sentiment-based dashboard can surface. The broader finding is that LLM-based annotation can capture far more than the tonality of a customer’s language, offering strong potential for new business metrics grounded instead in the customer’s state (whether they were satisfied) and the cause of their problem extracted directly from the raw textual data of interactions and feedback.
[NLP-44] Efficiently Representing Algorithms With Chain-of-Thought Transformers
【速读】: 该论文旨在解决生成式推理模型(如链式思维,Chain-of-Thought, CoT)在高效模拟实际算法时的能力瓶颈问题。尽管已有理论表明,CoT 变压器(Transformer)可模拟图灵机并执行任意计算,但图灵机模型在算法设计与分析中效率低下、抽象层级过低,难以反映真实计算场景。相比之下,Word RAM 模型以其随机访问内存和单位成本操作 \bigO(\log n) -bit 字的操作特性,提供了更贴近实际的算法分析框架,其算法效率通常远优于图灵机模型。因此,核心问题是:CoT 变压器能否以近似最优的时间复杂度高效模拟 Word RAM 算法? 例如,能否在 \bigO(n \log n) 步内完成排序,或在 \bigO(E + V \log V) 步内运行戴克斯特拉(Dijkstra)算法?论文给出肯定回答,证明在三种不同设置下——包括有限精度、多项式对数宽度且具有右端唯一硬注意力的变压器,连续式链式思维(continuous CoT),以及基于循环神经网络(线性 RNN)的混合架构——CoT 变压器均可实现对任意 Word RAM 算法的高效模拟,仅需多对数级(poly-logarithmic)时间开销。当指令集为“平坦”结构时,该开销降至 \bigO(\log^2 n),若不涉及乘法操作,则进一步降低至 \bigO(\log n),显著优于已知的图灵机模拟方式所导致的二次方开销。解决方案的关键在于通过精心设计的注意力机制与状态表示方式,将高阶算法操作映射至变压器的序列推理能力,从而突破传统模拟范式的性能瓶颈。
链接: https://arxiv.org/abs/2606.19697
作者: Yanhong Li,Anej Svete,Ashish Sabharwal,William Merrill
机构: Allen Institute for AI (艾伦人工智能研究所); ETH Zürich (苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The increasing popularity of \emphreasoning models – language models that output a series of reasoning or thought tokens before producing an answer – is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emphWord RAM model with random-access memory and unit-cost operations on \bigO(\log n) -bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emphCan CoT transformers efficiently simulate Word RAM algorithms? For instance, can they sort n items in \bigO(n \log n) steps or run Dijkstra’s algorithm in \bigO(E + V \log V) steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emphcontinuous CoT, where reasoning takes the form of vectors rather than tokens, and a \emphhybrid architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emphcan efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in n . This overhead reduces to log-square when the Word RAM has a ``flat’’ instruction set, and only logarithmic for multiplication-free flat instructions – in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.
[NLP-45] Code-Switching Reveals Language Anchoring in Multilingual LLM s
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在处理代码切换(Code-Switched, CS)输入时性能显著下降的问题。尽管用户对模型处理跨语言混合输入的期望日益增长,但现有模型在面对语言混杂输入时的表现往往劣于单一语言场景。为深入理解这一性能退化现象,研究提出以语法强制型代码切换作为受控诊断场景,通过引入“锚定偏差”(Anchor Bias)这一几何度量方法,量化代码切换隐藏状态相对于源语言与目标语言表征的锚定倾向。实验结果揭示出一种一致的“语法框架效应”:以源语言为语法框架的代码切换保持源语言锚定,而以目标语言为框架的代码切换则向目标语言偏移,并导致更严重的问答(Question Answering, QA)性能下降。基于此表征规律,论文提出一种推理阶段干预方法CANVAS(Contextual Anchor-based Neural Vector Alignment Steering),其核心在于从输入中提取源语言侧的“画布”(canvas),并在预填充阶段软性引导目标语言隐藏状态向源语言锚点对齐。该方法在多种MLLM和代码切换条件下均能有效恢复QA F1分数,表明内部锚定信号可作为缓解代码切换推理失败的可操作优化目标。
链接: https://arxiv.org/abs/2606.19668
作者: Jeonghyun Park,Seunghyun Yoon,Yonghyun Jun,Hwanhee Lee
机构: Chung-Ang University (中央大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注: 36 pages, 13 figures, 27 tables
Abstract:Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.
[NLP-46] CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因引入外部证据导致提示(prompt)长度增加、预填充(prefill)成本上升的问题。尽管服务引擎如vLLM中的前缀缓存(prefix caching)可降低计算开销,但其有效性依赖于请求间具有相同的令牌前缀;而在基于事实的生成任务中,相邻查询虽可能检索到重叠证据,但因检索顺序不同,难以形成可复用的前缀重叠。为应对这一挑战,论文提出CacheWeaver——一种轻量级提示层方法,通过维护最近服务过的证据序列的前缀树(prefix tree),采用贪心策略将最具可复用性的前缀优先排列,从而最大化缓存利用率。该方法不改变底层推理引擎与检索结果集,仅在检索与推理之间引入一个调度层。实验表明,在三种vLLM配置下,该方法相较传统按检索顺序进行前缀缓存的方案,将中位数首次生成时间(TTFT)降低了约20%-33%,且未影响问答任务的答案质量;其贪心策略实现了接近理想排序(oracle ordering)97.5%的性能增益,证明通过简单的调度层即可有效恢复大部分前缀局部性,是提升RAG系统效率的关键。
链接: https://arxiv.org/abs/2606.19667
作者: Kaizhen Tan,Rong Gu,Mingyuan Li
机构: Heinz College of Information Systems and Public Policy, Carnegie Mellon University (卡内基梅隆大学信息系统与公共政策学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.
[NLP-47] A Layered Security Framework Against Prompt Injection in RAG -Based Chatbots
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)部署中最为严重的漏洞——提示注入(Prompt Injection),尤其是检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在的间接提示注入问题。现有防御措施多局限于推理流程的单一阶段,如输入过滤无法审查检索到的文档内容,而输出监控则无法阻止恶意载荷进入模型,导致被污染的知识库文档可影响所有查询到该内容的用户。为此,论文提出一种三层协同防御框架:第一层通过基于规则的模式库与微调的语义异常分类器对用户输入进行筛查;第二层在上下文组装阶段引入基于溯源的指令优先级机制,确保检索内容不能覆盖操作者预设策略;第三层利用策略规则引擎与语义漂移检测器对模型输出进行审计,保障输出合规性。同时,系统建立持续审计闭环,聚合结构化日志以支持分类器的迭代更新,从而适应新型攻击模式。该框架具备模型无关性,可作为中间件部署且无需修改底层LLM。在GPT-4o、Llama 3和Mistral 7B上对5,080个样本的评估表明,该框架将攻击成功率(ASR)从71.4%降至11.3%,优于最优单层基线27.3个百分点,较已有防护系统提升23.8个百分点,同时保持4.8%的误报率与61.2毫秒的中位延迟开销。消融实验进一步验证了三层次防御具有互补性,其联合效应显著超越各层单独贡献之和。
链接: https://arxiv.org/abs/2606.19660
作者: Gulshan Saleem,Nisar Ahmed,Muhammad Imran Zaman,Ali Hassan
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Submitted in ICCK Transactions on Information Security and Cryptography
Abstract:Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4% to 11.3%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.
[NLP-48] SAGE-OPD: Selective Agent -Guided Intervention for Multi-Turn On-Policy Distillation
【速读】: 该论文旨在解决多轮交互场景下基于策略的蒸馏(On-policy Distillation, OPD)在现实大语言模型(Large Language Model, LLM)智能体训练中因暴露偏差(exposure bias)导致的性能退化问题。现有方法多集中于单轮设置,而在多轮交互中,早期错误会累积并影响后续观测,标准密集的令牌级OPD易对语义合理的替代输出过度惩罚,强化局部退化行为(如重复动作),并沿分布外的历史传播不可靠的教师监督信号,从而导致模型鲁棒性下降。其解决方案的关键在于提出一种无需验证器(verifier-free)的选择性干预框架SAGE-OPD:首先根据环境反馈与教师判断,动态决定是否对某一轮学生输出进行跳过或干预;其次,通过引入教师置信度加权机制,对令牌级蒸馏施加差异化权重,降低不确定性教师分布对异常或模糊历史的影响;最后,采用损失归一化策略,在保持标准OPD整体损失尺度的同时保留轮次级别的选择性加权。实验表明,SAGE-OPD在多个智能体任务上显著优于基线方法,尤其在ALFWorld未见成功率上相较标准OPD提升达13.3%相对性能。消融研究进一步验证了轮次级干预、教师置信度加权和损失归一化三者具有互补优势。研究结论表明,有效的多轮OPD应维持在策略内训练,但教师监督需以选择性方式仅作用于必要且可靠的干预轮次。
链接: https://arxiv.org/abs/2606.19659
作者: Yuhang Zhou,Lizhu Zhang,Yifan Wu,Mingyi Wang,Bo Peng,Jiayi Liu,Xiangjun Fan,Zhuokai Zhao
机构: Meta AI(元宇宙人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 21 pages, 3 figures
Abstract:On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on single-turn settings, while realistic LLM agents interact with environments over multiple turns. In this regime, early errors can alter future observations and compound across the trajectory, and standard dense token-level OPD becomes brittle, as it may over-penalize semantically valid alternatives, reinforce local degeneracies such as repeated actions, and propagate unreliable teacher supervision on off-distribution histories. We propose SAGE-OPD, a verifier-free selective intervention framework specifically designed for multi-turn OPD. Instead of applying teacher supervision uniformly across all turns, SAGE-OPD first observes environment feedback and uses teacher judgment to decide whether each student response should be skipped or intervened on. To further address compounding errors, SAGE-OPD weights token-level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted or ambiguous histories. Finally, SAGE-OPD applies loss normalization to preserve the overall loss scale of standard OPD while retaining selective turn-level weighting. Experiments on agent tasks show that SAGE-OPD consistently improves over baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies further demonstrate that turn-level intervention, teacher confidence weighting, and loss normalization provide complementary benefits. Our results suggest that effective multi-turn OPD should remain on-policy, but teacher supervision should be selectively allocated to turns where intervention is necessary and reliable.
[NLP-49] From 50K to 8.2 Million in 24 Hours: Vozinhas Algorithmic Consecration and the Multilingual Making of World Cup Visibility
【速读】: 该论文旨在解决多语言语境下社交媒体叙事如何建构算法化“神圣化”(algorithmic consecration)过程的机制问题,具体聚焦于2026年国际足联世界杯期间,几内亚比绍籍门将Vozinha在西班牙与几内亚比绍0-0平局赛后,其社交媒体影响力急剧上升这一现象的语言学与传播学动因。研究的关键在于构建一个跨语言、可复现的计算话语分析框架:通过创建葡萄牙语、西班牙语、英语和法语四语语料库,提出基于线索的九框架叙事分类体系(nine-frame narrative taxonomy),并开发一种结合大语言模型(LLM)辅助建议与人工验证的可复现标注流程。此外,研究将平台粉丝数本身作为语言学对象处理,视其为可叙述的可见性证据而非单纯量化指标,强调对数据点按价值类别、置信度与证据类型进行分类,并仅以2026年6月16日15:47 UTC的精确抓取值8,235,652为唯一锚点,其余数据均以估计范围或阈值形式呈现。结果表明,不同语言承载了差异化的叙事框架——葡萄牙语侧重动员,西班牙语突出危机,英语强调民族建构,而各语言共同呈现出以平台指标为核心的展演景观,使边缘性体育表现获得全球可见性。作为v0.1试点版本,论文公开了语料库结构、框架分类体系、标注指南、哈希化视觉证据日志及类型化时间线,同时指出未来需开展完整双人标注与标注者间一致性评估。
链接: https://arxiv.org/abs/2606.19647
作者: Vinicius Covas
机构: Universidad Anáhuac México (安纳华克大学墨西哥分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 11 pages, 4 figures, 3 tables; v0.1 pilot preprint. Dataset and evidence package available at this https URL
Abstract:We present a multilingual computational discourse analysis of how language constructed the algorithmic consecration of Vozinha, the 40-year-old Cape Verde goalkeeper, after Spain 0-0 Cape Verde at the 2026 FIFA World Cup. The study contributes a multilingual corpus in Portuguese, Spanish, English, and French; a nine-frame narrative taxonomy with cue-based frame annotation; a reproducible annotation pipeline combining LLM-assisted suggestion with human validation; and an analysis of cross-lingual narrative diffusion across discourse phases. We treat the platform follower count itself, narrated as “50k to 8M”, as a linguistic object: a circulating and narratable proof of visibility rather than a mere measurement. The follower-growth timeline is used only as contextual metadata: we reconstruct a conservative phase structure, not a continuous API-native series, and type every datapoint by value class, confidence, and evidence type. The only exact primary scraper anchor is 8,235,652 followers at 2026-06-16 15:47 UTC; all other figures are reported as estimated ranges or thresholds, including an estimated pre-match baseline of 45k-56k. Findings suggest that distinct languages carried distinct frames: Portuguese mobilization, Spanish crisis, English nation-making, and a shared platform-metric spectacle through which peripheral athletic performance became globally visible. As a v0.1 pilot, the paper releases the corpus schema, frame taxonomy, annotation guidelines, hashed visual-evidence log, and typed timeline, while flagging full double annotation and inter-annotator agreement as planned work.
[NLP-50] MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection
【速读】: 该论文旨在解决《希伯来圣经》中文本复用(textual reuse)的计算检测问题,尤其针对传统基于词汇重叠的方法在面对释义、词项替换或句法重构等复杂平行结构时表现不佳的局限性。其核心解决方案是提出MiqraBERT,一个基于AlephBERT(现代希伯来语编码器)微调得到的Sentence-BERT模型,专用于圣经希伯来语经文级别的语义相似性分析。该模型通过1,650对标注的经文与半节经文对(包括825个真实平行结构与825个随机负样本)进行训练,利用余弦相似度回归学习一个嵌入空间,使真实平行经文在语义上聚集而无关经文则被有效分离。评估结果显示,MiqraBERT在分布分离性能上较预训练基线提升2.7倍,模糊重叠区域从约24%降至约6%;叙事类平行结构的召回率@10达到87.1%,而诗歌类平行结构仍低于9%,反映出明显的文体依赖性。因此,MiqraBERT的有效应用范围目前主要集中于叙事性文本复用的识别。
链接: https://arxiv.org/abs/2606.19638
作者: David M. Smiley
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Textual reuse pervades the Hebrew Bible, yet the computational methods used to detect it still rest largely on lexical overlap, and they falter once a parallel involves paraphrase, lexical substitution, or syntactic reworking. This paper introduces MiqraBERT, a Sentence-BERT model finetuned from AlephBERT (a Modern Hebrew encoder) for verse-level semantic similarity in Biblical Hebrew. The training set comprises 1,650 labeled verse and half-verse pairs: 825 true parallels drawn from the Chronicles synoptic material and from foundational studies of poetic parallelism, balanced against 825 randomly sampled negatives. Through cosine-similarity regression, the model learns an embedding space in which parallel verses cluster together and unrelated verses move apart. We evaluate separation with distribution-based metrics, Wasserstein distance and the overlap coefficient, across ten random seeds. MiqraBERT improves distributional separation 2.7-fold over the pre-trained baseline and reduces the ambiguous overlap region from roughly 24% to about 6%. Narrative synoptic parallels reach a recall@10 of 87.1%; poetic parallels remain difficult, below 9%. This genre-dependent asymmetry confines the model’s reliable scope to narrative textual reuse. MiqraBERT is publicly available at this https URL
[NLP-51] Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text
【速读】: 该论文旨在解决临床自然语言处理(NLP)中对自杀行为检测数据集的过度依赖问题,特别是将电子健康记录(EHR)中的临床文档视为比社交媒体更可靠的“真实标签”(ground truth)这一默认假设。其核心问题是:当前基于EHR的自杀意念(suicidality)数据集实际上嵌入了特定的操作化定义(operationalization),这些定义受制于数据生成者的身份、事件边界设定方式以及模糊性处理机制,从而可能扭曲对自杀行为的真实理解。解决方案的关键在于揭示并批判性审视这些数据集背后的隐含假设——包括由治理约束导致的样本选择偏差、基于国际疾病分类(ICD)的队列筛选、单标注者标注、以及以住院期为单位的数据聚合等做法,均使标签反映的是临床医生的判断而非客观事实,并将自杀意念视为具有明确边界的事件,且默认可通过病历文本可靠推断意图。进一步的语言学分析表明,相同标签下包含多种在时间性、否定表达和不确定性方面存在显著差异的临床表述。因此,论文主张:在将此类标签视为“真实标签”前,临床NLP研究必须系统审查其背后的数据构建逻辑与潜在偏见。
链接: https://arxiv.org/abs/2606.19637
作者: Priyanshi Garg,Ishita Rao,Jieqiong Ding,Amandalynne Paullada
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)
Abstract:Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.
[NLP-52] oten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese
【速读】: 该论文旨在解决传统基于统计的分词方法(如字节对编码,Byte-Pair Encoding)在处理工程领域技术文本时存在的语义盲区问题,即其无法识别结构化技术实体(如物理量、数值、单位、符号表达式),导致这些关键语义单元被分割为语义任意的子词,从而破坏了技术信息的完整性与可解释性。针对这一问题,论文提出一种基于知识的本体论分词框架TOTEN(Ontology-based Tokenization with Typed Entities),其核心创新在于将传统的统计学习范式替换为基于形式化工程实体本体(Ontology of Engineering Entities, OEE)的声明式分类机制。TOTEN的关键解决方案包含三个核心组件:(1)本体(O)——系统性地建模工程实体的类型、结构原则、组合关系及可保持不变量;(2)分类函数(classify)——将原始文本映射为具有语义类型的区域;(3)实例化器族(inst_tau)——生成自描述的结构化表示。该方法通过与三个外部权威源(Pint用于量纲一致性校验、Unicode字符数据库用于排版鲁棒性、RSLP用于葡萄牙语形态学分析)的确定性耦合,显著增强了系统的鲁棒性。在内部物理验证基准EngQuant(N=800)和四个外部巴西葡萄牙语语料库(N=1771有效样本)上的内在评估表明,TOTEN在所有对比中实现了单位层面的本体原子性,并在外部语料上实现了0.775–0.904的数值重建准确率,显著优于最佳基线Quantulum3(0.627–0.703),在EngQuant上亦达0.780(基线为0.340),差异具有统计显著性(McNemar检验结合Holm校正)。此外,内部与外部排名间的斯皮尔曼相关性验证了控制基准的并发效度,量纲等价性表现与作为参考标准的Pint相当,证明了系统在量纲权威继承上的有效性。
链接: https://arxiv.org/abs/2606.19626
作者: Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa
机构: Aia Context(艾亚上下文); Universidade Federal do Maranhão(马拉尼昂联邦大学); Universidade de São Paulo(圣保罗大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple O, classify, inst_tau: the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction – ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction – over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.
[NLP-53] Where Does Social Reasoning Come From? Capability Provenance in Language Models
【速读】: 该论文旨在解决大模型中特定推理能力(如社会推理与STEM推理)的来源问题,即识别预训练语料库中哪些区域对模型的特定能力具有关键支持作用。传统方法依赖文档级别的训练数据归因(training-data attribution),但其得分噪声较大,难以精确映射到具体语料库区域;且已有研究多聚焦于事实性知识而非推理能力。本文的关键解决方案是采用基于梯度的归因方法(TrackStar via Bergson),在去重后的Dolma3混合语料库中选取工作集,通过WebOrganizer的24格式×24主题分类体系(共576个类别桶)聚合影响权重,并设计2×2对照实验,对比社会推理(SocialIQA)与STEM推理(ARC-Challenge)在领域(社会 vs. STEM)和能力类型(推理 vs. 知识)上的差异。结果表明,社会推理与STEM推理分别依赖于语料库中质性上不同的区域,且这种区分在推理层面比知识层面更为显著。进一步地,通过目标导向的机器遗忘(targeted machine unlearning)进行部分因果验证:移除高归因主题桶(如文学类对SocialIQA)导致对应基准性能下降,显著优于组内随机基线,从而支持了归因结果的因果有效性。研究开源了全部代码、采样清单、桶级影响矩阵及遗忘检查点,为可解释性分析提供了可复现的基础设施。
链接: https://arxiv.org/abs/2606.19625
作者: Glenn Matlin,Chandreyi Chakraborty,Saehee Eom,Mika Okamoto,Rayan Castilla,Louis Jaburi,Alvin Deng,Taywon Min,Lucia Quirke,Stella Biderman,Mark Riedl
机构: Georgia Institute of Technology, College of Computing; MATS Program; EleutherAI; KAIST AI; Georgia Tech AI Safety Initiative
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under review at COLM 2026 (Conference)
Abstract:We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model’s predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer’s 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.
[NLP-54] A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization
【速读】: 该论文旨在解决越南语多文档抽象式摘要(multi-document abstractive summarization)这一挑战,其核心问题是如何在保持摘要内容连贯性与信息完整性的同时,有效整合多篇文档的语义信息。针对该问题,论文提出了一种新颖且简洁的文档压缩策略,该策略以“黄金摘要”(golden summary)为指导,确保层次化摘要流程中各阶段之间的高度相关性,从而提升最终摘要的质量。其解决方案的关键在于利用黄金摘要引导文档压缩过程,使每一步处理均服务于最终生成高质量摘要的目标。实验结果显示,该方法在VLSP 2022公开测试集上取得了0.2468的ROUGE2-F1分数,并能生成流畅、简洁的摘要;此外,研究还通过引入外部数据扩充了越南语多文档摘要数据集,所扩展的数据已公开共享,为社区提供了宝贵资源。
链接: https://arxiv.org/abs/2606.19591
作者: Vu Nguyen Nguyen Xuan,Huy Ngo Quang
机构: Aimesoft JSC (Aimesoft公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: originally written in 2022
Abstract:In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP’s public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.
[NLP-55] Uncertainty Decomposition for Clarification Seeking in LLM Agents
【速读】: 该论文旨在解决交互式大语言模型(LLM)代理在面对任务规范不明确(underspecified)时,缺乏有效不确定性感知与表达能力的问题。现有经典的随机性/认知性不确定性框架无法满足交互式场景的需求,尤其在缺乏标注轨迹、受限于黑箱API及交互延迟预算等实际部署约束下,传统基于对数概率、多采样或训练的方法难以适用。为此,论文提出一种简洁的基于提示(prompt-based)的不确定性分解方法,将动作置信度与请求不确定性(u)解耦,使代理能够在任务定义模糊时主动寻求澄清。其解决方案的关键在于通过提示工程实现无需额外训练的实时不确定性分解,从而支持代理进行前瞻性澄清请求和共享心智模型构建。为评估该方法,研究引入两个增强澄清能力的基准测试(WebShop-Clarification 和 ALFWorld-Clarification),其中50%的任务被故意设计为不明确。在五个主流LLM骨干模型(GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B)上的系统对比表明,该方法在ALFWorld-Clarification上相较ReAct+UE和UAM分别提升澄清F1 73%和36%,并在所有骨干模型上均优于基线,验证了其泛化能力。
链接: https://arxiv.org/abs/2606.19559
作者: Gregory Matsnev
机构: ITMO University (圣彼得堡国立研究技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 8 figures. Source code: this https URL
Abstract:Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints – black-box APIs, interactive latency budgets, and the absence of labeled trajectories – rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.
[NLP-56] Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在低精度量化部署场景下,如何有效评估和预测模型性能的问题。现有实践中常采用每词元的相对熵(per-token KL divergence, KLD)作为高精度参考模型的低成本代理指标,以衡量模型输出与高质量基准之间的差异。然而,本研究发现,尽管在整体量化层级上KLD与下游任务得分存在显著负相关(Qwen: ρ = -0.72;Devstral: ρ = -0.86,均p < 0.001),但在接近基线的“静默区域”(silent zone)内,这种相关性完全消失(Qwen: ρ = +0.00;Devstral: ρ = -0.24, p = 0.36),且在14种不同测量变体中均未恢复显著性。进一步分析表明,KLD主要反映的是与参考模型的分歧量(disagreement volume),而非分歧方向(direction),其在静默区内的复合相关性高达+0.94(Qwen, p < 0.001)和+0.55(Devstral, p = 0.03),说明其本质是衡量“有多少不同”,而非“是否正确”。此外,在提示级层面,KLD对代码生成失败的预测能力较弱(失败/通过几何平均比区间为[1.08, 1.22]),且无法有效作为跨模型路由工具(在争议提示上的准确率仅为42.3%–49.4%)。因此,解决方案的关键在于:必须区分量化后模型的分歧程度与分歧方向,仅依赖KLD作为性能代理指标具有根本局限性,尤其在接近基线的敏感区域,需引入更精细的结构化分析框架以揭示分歧的本质属性。
链接: https://arxiv.org/abs/2606.19558
作者: Miloš Nikolić,Ali Hadi Zadeh,Enrique Torres Sanchez,Andreas Moshovos
机构: ByteShape; University of Toronto; Vector Institute for Artificial Intelligence
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ( \rho=-0.72 on Qwen and \rho=-0.86 on Devstral, both with p0.001 ). However, this relationship collapses to non-significance in the near-baseline silent zone ( \rho=+0.00 on Qwen and \rho=-0.24 , p=0.36 , on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in [1.08,1.22] across five models on LiveCodeBench, and fails as a cross-model router, achieving only 42.3%-49.4% accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite \rho=+0.94 ( p0.001 ) on Qwen and +0.55 ( p=0.03 ) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.
[NLP-57] LaViSA: A Language and Vision Structural Ambiguity Benchmark
【速读】: 该论文旨在解决语言理解中由句法结构引发的结构性歧义(structural ambiguity)问题,即同一句子因语法结构复杂而存在多种有效语义解释,从而影响模型对真实语义的准确捕捉。其核心挑战在于如何利用视觉场景作为语义线索,帮助模型在多义句式中选择正确的语义解释。解决方案的关键在于构建一个名为LaViSA的基准测试数据集,该数据集包含七类结构性歧义句子、对应的消歧句及其匹配的视觉图像,系统性地评估视觉与语言模型(VLMs)在借助视觉上下文消除歧义方面的能力。通过该基准,研究揭示了当前主流VLMs虽能在一定程度上利用视觉信息缓解歧义,但在处理特定类型的歧义及细微的视觉-语义差异时仍存在显著局限,凸显了在跨模态推理中进一步提升语义解析能力的必要性。
链接: https://arxiv.org/abs/2606.19552
作者: Lee Sangmyeong,Shun Inadumi,Koichiro Yoshino
机构: Nara Institute of Science and Technology; Guardian Robot Project RIKEN; The University of Osaka
类目: Computation and Language (cs.CL)
备注:
Abstract:Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.
[NLP-58] Reliability without Validity: A Systematic Large-Scale Evaluation of LLM -as-a-Judge Models Across Agreement Consistency and Bias
【速读】: 该论文旨在解决当前大语言模型作为裁判(LLM-as-a-Judge)在评估语言模型时普遍依赖精确匹配一致率(exact-match agreement)这一评价指标所引发的系统性偏差问题。该指标未校正随机一致性,导致对模型判别能力的评估结果被严重夸大。论文通过迄今为止规模最大的系统性评估,涵盖来自九个供应商的21名大语言模型裁判,在MT-Bench、JudgeBench和RewardBench三个基准上,采用三种评估协议(一致性、稳定性与偏见审计)进行了118次实验,总计约54.1万次独立判断。研究发现:精确匹配与Cohen’s kappa之间的“kappa压缩”现象普遍存在(在MT-Bench上达33–41个百分点),不同基准下裁判排名波动最大可达14位;即便在测试-重测可靠性高达0.95的情况下,仍存在显著的位置偏倚(0.10),形成“一致性—偏倚悖论”;在单一成对评分标准下,冗长性偏倚较小(0.011)。基于上述发现,论文提出了一套最小可行验证协议(Minimum Viable Validation Protocol),其核心在于引入校正随机一致性的统计指标(如Cohen’s kappa)、多轮稳定性检验及系统性偏倚审计,以实现更可靠、可复现且公平的模型评估。
链接: https://arxiv.org/abs/2606.19544
作者: Justin D. Norman,Michael U. Rivera,D. Alex Hughes
机构: UC Berkeley School of Information (加州大学伯克利分校信息学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen’s kappa is universal (33–41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test–retest reliability (0.95) coexists with severe position bias (0.10) in two production-deployed judges (instantiating a consistency–bias paradox), and verbosity bias is small (0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.
[NLP-59] PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models
【速读】: 该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在需要对图像中多个区域进行描述的任务中,因依赖自回归生成机制而导致推理效率低下的问题。其核心解决方案在于提出一种名为PerceptionDLM的多模态扩散语言模型(Diffusion Language Model, DLM),充分利用扩散模型固有的并行解码特性,通过引入高效的提示设计与结构化注意力掩码机制,实现对多个被遮蔽区域的同时感知与并行生成。该设计使模型能够在序列级和词元级均实现多区域描述的并行处理,显著提升了多区域感知任务的推理效率。为系统评估该并行性能力,研究构建了新的并行详细定位描述基准(Parallel Detailed Localized Captioning Benchmark, ParaDLC-Bench),通过扩展DLC-Bench以支持每张图像包含多个区域掩码,实现了对生成质量与推理效率的联合评估。实验结果表明,PerceptionDLM在保持区域描述性能的同时,大幅提升了多区域感知任务的处理速度,首次验证了利用扩散语言模型优势实现并行区域描述与感知的可行性。
链接: https://arxiv.org/abs/2606.19534
作者: Yueyi Sun,Yuhao Wang,Jason Li,Ye Tian,Tao Zhang,Jacky Mai,Yihan Wang,Haochen Wang,Jinbin Bai,Ling Yang,Yunhai Tong
机构: Peking University MSALab(北京大学MSALab); ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code available at this https URL
Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.
[NLP-60] DeXposure-Claw: An Agent ic System for DeFi Risk Supervision
【速读】: 该论文旨在解决去中心化金融(DeFi)环境中监管机构面临的快速演化、网络关联的信用风险监测难题。传统通用大语言模型(LLM)代理在此场景下表现不佳,因其过度解读微弱信号并推荐高风险干预措施,而现有评估方法又缺乏与监管目标对齐的度量手段来量化由此产生的误报问题。为此,本文提出一种基于预测的智能体监管系统——DeXposure-Claw,其核心在于通过结构化证据流对LLM决策进行约束:首先,采用图时间序列基础模型(DeXposure-FM)预测未来的风险暴露网络;其次,利用确定性监控机制与压力情景将预测结果转化为类型化警报、归因信号及情景证据;最后,在监管决策前引入数据健康度与置信度双重门控,确保生成可审计的监管工单及其可解释理由。为系统评估,研究进一步构建了六轴评价基准DeXposure-Bench,其中决策轴基于监管对齐的绝对损失真实值与显式误干预率进行评分。在五年周度真实数据上的实验充分验证了该系统的有效性。
链接: https://arxiv.org/abs/2606.19501
作者: Aijie Shu,Bowei Chen,Wenbin Wu,Cathy Yi-Hsuan Chen,Fengxiang He
机构: University of Edinburgh(爱丁堡大学); University of Glasgow(格拉斯哥大学); University of Cambridge(剑桥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Risk Management (q-fin.RM)
备注:
Abstract:Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at this https URL.
[NLP-61] Diffusion Language Models: An Experimental Analysis
【速读】: 该论文旨在解决当前扩散语言模型(Diffusion Language Models, DLMs)在性能评估与实际应用中面临的可比性难题,即由于评价协议、数据集、推理预算及生成超参数等方面的差异,导致难以准确衡量不同DLM架构的能力及其在生成质量与计算效率之间的权衡。其解决方案的关键在于开展系统性的实验分析:在八个涵盖推理、编程、翻译、知识问答及结构化问题求解的基准任务上,对八种前沿DLM进行统一条件下的横向对比,并同时考虑生成质量与计算效率;此外,通过控制变量实验,深入分析了去噪步数、上下文长度、块大小及并行去掩码策略等关键推理阶段因素的影响。研究揭示了DLM行为高度依赖于生成时的设计选择,从而明确了不同任务、架构和推理预算下性能与效率之间的差异化权衡,为现代DLM的实际部署提供了切实可行的指导。
链接: https://arxiv.org/abs/2606.19475
作者: Thomas Bertolani,Davide Bucciarelli,Leonardo Zini,Marcella Cornia,Lorenzo Baraldi
机构: University of Modena and Reggio Emilia (摩德纳与雷焦艾米利亚大学); University of Pisa (比萨大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.
[NLP-62] Characterizing Narrative Content in Web-scale LLM Pretraining Data
【速读】: 该论文旨在解决大规模预训练语料库中叙事结构(narrative structure)长期缺乏细粒度分析的问题,尽管叙事是人类沟通的基本模式。其核心挑战在于如何在异质性极强的网络规模语言模型(LLM)预训练数据中量化与表征叙事特征。解决方案的关键在于提出一个基于叙事理论的系统性框架,涵盖代理(agency)、背景(setting)和事件(events)三个核心叙事元素,并将其操作化为11个可解释的维度。通过采样与标注400段具有代表性的文本片段,研究者微调并验证了基于RoBERTa架构的NarraBERT模型,用于细粒度叙事预测。随后将该模型应用于300万段文本,构建出新的大规模叙事标注数据集NarraDolma。研究发现:(i)叙事结构可在大规模、高度异质的数据中被有效测量;(ii)网络文本背后存在连续且多维的叙事结构;(iii)叙事质量在不同预训练数据源和主题间分布不均,而现有数据筛选流程未能识别或纳入此类差异。该研究提出的框架、数据集及分析方法为理解大模型预训练数据中的叙事分布特性及其对叙事推理任务的影响提供了坚实基础,并已公开发布NarraDolma与NarraBERT以供社区使用。
链接: https://arxiv.org/abs/2606.19468
作者: Teagan Johnson,Elliott Ash,Andrew Piper,Maria Antoniak
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages of main content, 28 total pages. 30 figures
Abstract:The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.
[NLP-63] hermodynamic Signatures of Reasoning : Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的幻觉检测(hallucination detection)关键问题,即如何有效识别模型生成内容中的虚假或不实信息。现有方法虽已发现注意力机制导出的图拉普拉斯谱(attention-derived graph Laplacians spectrum)蕴含丰富的推理质量信号,但传统谱诊断手段仅依赖少数特征值或人工选取的标量进行总结,导致谱结构信息大量未被利用。为此,本文提出一种名为自由能签名(Free-Energy Signatures, Fes)的新颖谱描述符,其核心思想是将每一层的注意力拉普拉斯矩阵视为哈密顿量,并从中提取热力学势函数,包括配分函数、自由能、谱熵、热容以及随机矩阵理论(Random Matrix Theory, RMT)下的谱形式因子(spectral form factor)。Fes的关键创新在于:(1)通过构建完整的谱热力学框架,充分挖掘拉普拉斯谱的内在结构信息;(2)理论证明了Fes在注意力扰动下具有Lipschitz稳定性,具备良好的鲁棒性;(3)证明了Fes在特定正则性和网格分辨率假设下可逼近由矩导出的谱泛函,具备更强的表达能力;(4)建立了基于Fes的无训练探测器的有限样本概率近似正确(PAC)边界,保障其在统计上的可靠性。实验表明,在六种开源大模型与六个基准测试上,基于Fes的轻量级探测器在无参数更新的前提下,取得了所有注意力谱基线中的最高综合AUROC性能,平均优于LapEig 6.5点、优于GoR-4 2.4点;在完全无监督场景下,结合RMT偏离度的探测器亦达到均值AUROC 0.71,验证了其标签无关的可行性。进一步的互补性RMT分析揭示:正确生成遵循更接近威格纳-迪森(Wigner-Dyson)型的谱统计特性,而幻觉生成则表现出更接近泊松(Poisson)型的统计行为,从物理机制层面解释了Fes的有效性。
链接: https://arxiv.org/abs/2606.19404
作者: Salim Khazem
机构: Talan(塔兰)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer’s attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by +6.5 AUROC points and over GoR-4 by +2.4 points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC 0.71 , providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.
[NLP-64] How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned Not Architectural
【速读】: 该论文旨在解决生成式模型中前馈神经网络(FFN)块实际非线性程度难以量化的问题。传统观点将FFN视为纯粹的非线性计算单元,但其真实线性可恢复性(linear recoverability, Rlin2)在训练后究竟如何尚未被系统测量。为此,作者提出一种无优化器依赖的度量方法:将每个FFN视为位置独立的输入-输出映射,并通过闭式最小二乘法分解出其最佳线性逼近部分与残差项,以残差所占未解释方差比例定义Rlin2,从而衡量该模块的线性程度。实验表明,即使在同一模型内部(如GPT-2、Pythia-160m和Llama-160m),各层Rlin2值在0.3至0.99之间高度异质且非单调,且不取决于激活函数类型(例如同宽度的GELU模型表现差异显著),说明线性可恢复性是模型训练过程中学习到的个体属性而非架构决定。进一步分析发现,残差中的非线性信息无法通过低秩双线性探测有效捕获,表明未被线性部分解释的计算具有高阶或分布式结构特征。此外,该度量可作为针对性压缩信号——高可恢复性模块支持大尺度单层替换(如GPT-2早期FFN以8倍更少参数实现+0.77困惑度提升),而低可恢复性模块则提示压缩风险。研究还揭示了一个方法论陷阱:直接训练线性基线模型在病态变换激活上易出现严重欠收敛,因此本文始终报告闭式最小二乘的理论上限以确保评估可靠性。
链接: https://arxiv.org/abs/2606.19379
作者: Stuart Whipp
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 5 figures
Abstract:Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block’s linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (0.99) to strongly nonlinear (0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2’s early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout. Comments: 14 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2606.19379 [cs.LG] (or arXiv:2606.19379v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.19379 Focus to learn more arXiv-issued DOI via DataCite
[NLP-65] rustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol
【速读】: 该论文旨在解决多智能体大语言模型(Multi-agent LLM)系统在生成错误回答时,无法区分“可修复性失败”与“需遏制性失败”的核心问题。具体而言,现有重试策略对所有错误一视同仁(即重复尝试),导致人类监管者难以判断是否应继续尝试修复,还是应当直接终止生成流程。其解决方案的关键在于提出一种名为Argent Signaling Protocol (ASP) 的紧凑、机器可读的响应头协议,该协议为每个AI生成的回答附加结构化质量信号,包括:置信度(@C)、可溯源性(@G)、随机性(@S)以及假设指数(用于分类每条断言的证据基础)。这些信号使控制器能够精准识别出可修复的不完整输出与不可接受的无依据输出,并据此采取差异化处理策略——对前者进行修正,对后者实施拦截。实验表明,在独立模式下,ASP显著提升了多个小型本地模型(如Qwen 0.8B、Dobby 8B、SmolLM3 3B)在文档驱动问答任务中的通过率与术语覆盖率;在多智能体模式中,ASP侧车机制实现了100%阻断无依据上游输出的能力,有效防止了错误信息向下游决策代理传播,验证了其在保障生成质量与系统安全方面的关键作用。
链接: https://arxiv.org/abs/2606.19356
作者: Anantha Sharma
机构: Synechron Inc (Synechron公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:When multi-agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine-readable header that accompanies every AI-generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27-question document-grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP-instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail-to-pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi-agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations). Comments: 17 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.19356 [cs.CL] (or arXiv:2606.19356v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.19356 Focus to learn more arXiv-issued DOI via DataCite
[NLP-66] Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling
【速读】: 该论文旨在解决在给定计算预算下,生成式 AI(Generative AI)推理过程中验证(verification)粒度的最优选择问题。现有方法中,粗粒度的结果奖励模型(Outcome Reward Models, ORMs)与细粒度的过程奖励模型(Process Reward Models, PRMs)分别代表两种极端策略,但均无法在所有场景下实现计算效率最优化。本文提出统一理论框架GRACE(Granularity-Regulated Adaptive Computational Efficiency),首次将最优验证粒度建模为问题难度、验证器准确率和计算预算的显式函数,并证明存在相变现象:当计算预算充足或问题本身困难时,细粒度验证占优;而在低预算且问题简单的情况下,粗粒度验证更优。该理论统一了Best-of-N、束搜索(beam search)与步骤级蒙特卡洛树搜索(step-level MCTS)等方法于帕累托最优框架内,进而设计出一种自适应粒度策略,理论上可达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果验证了全部四项理论假设,所提自适应策略在相同计算开销下相比固定粒度基线最高提升3.1%准确率。
链接: https://arxiv.org/abs/2606.19354
作者: Ardit Krasniqi,Luan Vejsiu,Elira Dervishi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emphverifier, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emphwhat is the optimal granularity of verification under a given compute budget? Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbfGRACE (\underlineGranularity-\underlineRegulated \underlineAdaptive \underlineComputational \underlineEfficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of- N , beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1% accuracy at matched compute.
[NLP-67] Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning, ICL)场景下预测可靠性不足的问题。具体而言,ICL的性能高度依赖于提示设计与模型对上下文的理解能力,导致其预测失败的原因难以区分是源于数据固有的随机性(即异类不确定性,aleatoric uncertainty),还是模型自身的认知局限(即认知不确定性,epistemic uncertainty)。现有不确定性分解方法多针对标准生成任务设计,无法有效捕捉ICL中独特的动态机制。为此,本文提出“自函数向量”(self-function vectors)的概念,基于贝叶斯视角与ICL的机械可解释性,利用模型内部表示来建模上下文提示中隐含的学习概念,从而在贝叶斯框架内直接估计异类不确定性,避免了对脆弱输入或解码过程的依赖。此外,由于缺乏标准化评估基准,研究还构建了首个严谨的评估协议,通过受控的数据操纵实现对异类不确定性的精确量化,并将其与认知不确定性分离。基于此框架,在合成任务与真实数据集上的实验表明,所提方法相较于现有方法能更可靠地度量ICL下的不确定性,并具备实际应用价值,如幻觉检测等可信计算场景。该研究为连接不确定性量化的定量分析与模型行为的机制理解开辟了新路径。
链接: https://arxiv.org/abs/2606.19353
作者: Jinseok Chung,Minkyoung Song,Hyunji Jung,Namhoon Lee
机构: POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2026
Abstract:In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model’s ability to understand the context, obscuring whether failures arise from data properties or model limitations. Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard generation tasks, fail to capture the unique dynamics of ICL. To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL. These vectors leverage internal model representations to model the latent concept learned during in-context prompting, thereby enabling a direct estimation of aleatoric uncertainty within a Bayesian framework and circumventing the reliance on brittle input or decoding manipulations. Given the lack of established benchmarks and suitable evaluation protocols, we also propose the first and rigorous evaluation protocol, in which data is manipulated in controlled ways so as to quantify aleatoric uncertainty precisely and separately from epistemic uncertainty. With this new evaluation framework, initially grounded in synthetic tasks for conceptual development and subsequently extended to real-world datasets, we show that our proposed methodology can measure uncertainty of LLM predictions made under ICL more reliably than existing alternative methods. Moreover, we show it can be used as a practical tool for trustworthy-related applications, such as hallucination detection. Our findings pave a new direction for connecting the quantitative view of uncertainty with the mechanistic understanding of model behavior.
[NLP-68] Sign-Language Datasets at Scale: A Comprehensive Survey on Resources Benchmarks and Annotation Standards ACL2026
【速读】: 该论文旨在解决当前手语技术发展中存在的数据集碎片化、标注不一致以及语言覆盖范围有限等关键问题,这些问题严重制约了手语识别、翻译与生成技术的进展。其核心解决方案在于构建一个涵盖35种手语、共计120个数据资源的综合性手语数据集索引,并系统分析了模态不平衡、标注粒度差异及签名者偏差等主要挑战。为推动标准化与可复现性研究,论文提出一个包含24个维度的手语数据集信息表(Sign-Language Datasheet),并公开发布至GitHub平台,为未来手语技术的开发提供统一、可扩展且具包容性的基础支撑。
链接: https://arxiv.org/abs/2606.19352
作者: Yiming Ni,Zhi-Qi Cheng,Jiayu Li,Wei Cheng
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Main. 27 pages, 5 figures
Abstract:Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotations, and limited linguistic coverage. Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited. In this survey, we present a comprehensive index of sign-language datasets, covering 120 resources across 35 sign languages. We analyze key challenges such as modality imbalance, annotation granularity, and signer bias, and outline considerations for future dataset design. We also introduce a 24-field Sign-Language Datasheet and release a public GitHub repository (this https URL) to support standardized documentation and reproducible evaluation. Overall, our work provides a unified and practical foundation for developing inclusive, robust, and scalable sign-language technologies in real-world applications.
[NLP-69] Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的知识图谱(Knowledge Graph, KG)推理中幻觉(hallucination)问题。尽管现有方法尝试通过分析LLM内部状态或验证检索上下文的一致性来检测幻觉,但均忽略了知识图谱固有的结构信息,导致检测效果受限。其解决方案的关键在于提出LUCID——首个面向LLM-based KG推理框架的幻觉检测方法,通过联合利用LLM注意力得分、知识图谱语义信息与结构特征实现更精准的幻觉识别。具体而言,LUCID从注意力权重和语义相似度中提取节点与边的特征,并借助图神经网络(Graph Neural Network, GNN)融合知识图谱的拓扑结构,从而全面捕捉推理过程中的异常模式。实验在九个数据集上验证了LUCID相较于15种基线方法的先进性能,证明了其有效性与鲁棒性。
链接: https://arxiv.org/abs/2606.19351
作者: Xinyan Zhu,Yaoqi Liu,Yue Gao,Huadong Ma,Cheng Yang,Chuan Shi
机构: Beijing University of Posts and Telecommunications; Tsinghua University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.
[NLP-70] Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行多步推理任务时面临的高推理成本问题。现有轻量化方法多依赖于权重幅度或激活值等相关性指标进行剪枝,难以准确衡量注意力头(attention head)对推理过程的实际功能贡献。为此,论文提出一种无需训练的因果归因剪枝(Causal Attribution Pruning, CAP)方法,其核心在于通过干预性测量评估每个注意力头在推理任务中的因果影响:具体地,在一个小规模校准集上对各注意力头进行掩码操作,并估计其导致的预期性能下降,从而获得头级别的因果得分。这些得分被进一步映射为对应投影矩阵的权重级重要性值,实现细粒度剪枝。相较于仅依赖幅度或激活的剪枝策略,CAP通过直接捕捉注意力头的功能贡献,显著提升了剪枝后的模型性能表现。在ARC-Challenge数据集上,当稀疏度为20%时,CAP相较Wanda方法实现了高达61%的相对准确率提升;在GSM8K、StrategyQA和ARC-Challenge等多个基准测试中,使用Llama-3-8B-Instruct与Mistral-7B-Instruct模型验证,结果表明在中等稀疏度(10%-20%)下,CAP在多数配置中优于基线方法,尤其在Llama-3模型上的ARC-Challenge任务中表现突出。研究结果表明,基于注意力头层级的因果归因能够更有效地保持下游推理任务性能,优于传统相关性剪枝准则,但在50%稀疏度下仍受限于粗粒度的MLP层归因机制。
链接: https://arxiv.org/abs/2606.19350
作者: Amogh Sheth,Biruk Assefa,Yi Wen Huang,Andrew Lin,Yuhao Ge
机构: Edison Academy Magnet School; Massachusetts Institute of Technology; State University of New York College at Plattsburgh; The University of Texas at Austin; Independent Researcher
类目: Computation and Language (cs.CL)
备注: Accepted at the ICLR 2026 Workshop on LLM Reasoning. 13 pages, 2 figures
Abstract:Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning. For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. Unlike magnitude-only or activation-based criteria, CAP’s interventional measurement directly captures each head’s functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity. We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations. with especially large gains on ARC-Challenge for Llama-3. Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.
[NLP-71] Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLM s via Decoding Dynamics
【速读】: 该论文旨在解决扩散型大语言模型(dLLMs)中上下文学习(In-Context Learning, ICL)机制不明确的问题,尤其关注传统基于自回归(AR)模型的尾随查询模板在dLLMs中适用性不足的局限性。其核心挑战在于,尽管dLLMs具备双向注意力机制带来的空间灵活性,但现有实践仍沿用AR模型的固定查询位置设计,忽略了查询位置本身作为关键变量对生成质量的影响。解决方案的关键在于揭示了查询位置在dLLMs中是影响性能的一阶变量,并通过实证分析发现位置偏差对生成质量的影响与示例语义质量相当。进一步研究发现,这种敏感性源于注意力流中的空间“近期效应”(spatial “Recency Effect”)以及任务依赖的解码轨迹偏移。为应对无真值标签下的置信度评估难题,论文提出一种新型度量指标——平均置信度(C),用于追踪迭代解码过程中的稳定性。基于此,作者构建了无需训练的自适应路由策略Auto-ICL,能够动态优化查询位置,在多种推理与感知任务中稳健逼近最优性能,从而建立了dLLMs下空间ICL的基准框架。
链接: https://arxiv.org/abs/2606.19349
作者: Zhengheng Li,Panrui Li,Xuyang Liu,Puzhi Xia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 figures, 4 tables
Abstract:While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect’’ in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ( C_decoded ) fails in dLLMs. Instead, we propose Average Confidence ( \overlineC ), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.
[NLP-72] DeepSeek -V4: Towards Highly Efficient Million-Token Context Intelligence
【速读】: 该论文旨在解决大模型在长文本上下文处理中的效率与性能瓶颈问题,特别是在超长上下文(百万级token)场景下存在的计算开销大、内存占用高及推理延迟高等挑战。其核心解决方案的关键在于:(1)采用混合注意力架构,融合压缩稀疏注意力(Compressed Sparse Attention, CSA)与重度压缩注意力(Heavily Compressed Attention, HCA),显著提升长序列建模的计算效率;(2)引入流形约束超连接(Manifold-Constrained Hyper-Connections, mHC),优化传统残差连接机制,增强模型深层表达能力与训练稳定性;(3)设计新型优化器Muon,实现更快的收敛速度与更强的训练鲁棒性。通过在超过32T高质量多样语料上进行预训练,并结合全面的后训练流程,DeepSeek-V4系列模型在保持高性能的同时,实现了对百万级上下文的高效支持——例如,DeepSeek-V4-Pro在该场景下的单标记推理浮点运算量仅为DeepSeek-V3.2的27%,键值缓存占用降低至10%,大幅推动了长时序任务与测试时扩展(test-time scaling)的实际应用可行性。
链接: https://arxiv.org/abs/2606.19348
作者: DeepSeek-AI,Anyi Xu,Bangcai Lin,Bing Xue,Bingxuan Wang,Bingzheng Xu,Bochao Wu,Bowei Zhang,Chaofan Lin,Chen Dong,Chenchen Ling,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chengyu Hou,Chenhao Xu,Chenze Shao,Chong Ruan,Conner Sun,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Donghao Li,Dongjie Ji,Erhang Li,Fang Wei,Fangyun Lin,Fangzhou Yuan,Feiyu Xia,Fucong Dai,Guangbo Hao,Guanting Chen,Guoai Cao,Guolai Meng,Guowei Li,Han Yu,Han Zhang,Hanwei Xu,Hao Li,Haofen Liang,Haoling Zhang,Haoming Luo,Haoran Wei,Haotian Yuan,Haowei Zhang,Haowen Luo,Haoyu Chen,Haozhe Ji,Hengqing Zhang,Honghui Ding,Hongxuan Tang,Huanqi Cao,Huazuo Gao,Hui Qu,Hui Zeng,J Yang,JQ Zhu,Jia Luo,Jia Song,Jia Yu,Jialiang Huang,Jialu Cai,Jian Liang,Jiangting Zhou,Jiasheng Ye,Jiashi Li,Jiaxin Xu,Jiewen Hu,Jieyu Yang,Jin Chen,Jin Yan,Jingchang Chen,Jingli Zhou,Jingting Xiang,Jingyang Yuan,Jingyuan Cheng,Jingzi Zhou,Jinhua Zhu,Jiping Yu,Joseph Sun,Jun Ran,Junguang Jiang,Junjie Qiu,Junlong Li,Junmin Zheng,Junxiao Song,Kai Dong,Kaige Gao,Kang Guan,Kexing Zhou,Kezhao Huang,Kuai Yu,Lean Wang,Lecong Zhang,Lei Wang,Leyi Xia,Li Zhang,Liang Zhao,Lihua Guo
机构: DeepSeek-AI(深度求索)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models – DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) – both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at this https URL.
[NLP-73] How LLM s Fail and Generalize in RTL Coding for Hardware Design? EMNLP2026
【速读】: 该论文旨在解决大语言模型(LLM)在将顺序编程先验知识转化为硬件设计中并行时序逻辑时所面临的根本性瓶颈问题。其核心挑战在于,当前模型在生成寄存器传输级(RTL)代码时,尽管能通过优化手段有效消除语法错误,却难以克服深层的功能性错误,尤其是那些不可解的功能性错误(unsolvable functional errors),这些错误反映了模型在底层硬件设计认知上的持久知识鸿沟。解决方案的关键在于突破现有以对齐(alignment)为导向的改进范式,转而聚焦于提升模型的内在推理能力——即增强其对硬件行为建模与形式化验证的理解深度,而非仅依赖测试时采样或编译适配等表面优化策略。研究揭示,当前基于LLM的硬件生成系统的能力上限由预训练阶段的知识覆盖范围决定,无法通过增加计算资源或优化技巧实现突破,因此未来需更深入地探索模型的因果推理与形式化思维机制。
链接: https://arxiv.org/abs/2606.19347
作者: Guan-Ting Liu,Chao-Han Huck Yang,Chenhui Deng,Zhongzhi Yu,Brucek Khailany,Yu-Chiang Frank Wang
机构: NVIDIA Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Preview, under submission for EMNLP 2026
Abstract:Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.
[NLP-74] Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer
【速读】: 该论文旨在解决跨语言迁移学习中是否存在针对闪米特语系(Semitic languages)的特定知识迁移问题。研究通过在阿拉伯语上微调七种大规模语言模型(参数量介于40亿至6710亿之间),并评估其在闪米特语系及其他非闪米特语系控制语言上的零样本阅读理解表现,探究模型是否能够有效利用源语言中的语言结构或语义知识以提升目标语言任务性能。研究发现,在密集型与专家混合(Mixture-of-Experts, MoE)架构下均未观察到闪米特语系特有的迁移效应:基础表现较弱的模型在所有语言上均有显著提升,而基础表现较强的模型则无论语言家族如何,仅获得微小增益。此外,链式思维(chain-of-thought)消融实验表明,从微调中获益最多的模型同样能从推理阶段的思维链中获得相当程度的性能提升,暗示微调与推理机制的作用均在于任务格式对齐(task-format alignment),而非真正的跨语言知识迁移。因此,该研究的关键结论是:当前大模型的跨语言迁移效果主要源于任务格式适配,而非深层的语言学或语义知识迁移。
链接: https://arxiv.org/abs/2606.19346
作者: Ahmed Haj Ahmed,Ruochen Zhang,Alvin Grissom II
机构: Haverford College(哈弗福德学院); Brown University(布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We study cross-lingual transfer by fine-tuning seven large language models (4B–671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding – the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.
[NLP-75] Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts
【速读】: 该论文旨在解决系统性文献综述(Systematic Literature Reviews, SLRs)中因科学出版物激增而导致的手动研究筛选过程资源消耗大、效率低且结果不一致的问题,尤其针对需临床判断才能识别的健康相关生活质量(Health-Related Quality of Life, HRQoL)研究数据(如EQ-5D数据)的自动化检测难题。其解决方案的关键在于提出一种多阶段框架,整合少样本提示(few-shot prompting)、权重集成聚合(weight ensembling aggregation)与软堆叠元分类器(soft stacking meta-classifier),通过融合多个大型语言模型(Large Language Models, LLMs)——包括Google的Gemini和Gemma系列——的预测结果,显著提升检测性能。实验表明,基于gemini-2.5-pro、gemma-3-12b与gemma-3-27b的加权集成模型在公开标注数据集上达到0.74的加权F1分数和准确率,优于单个模型表现;同时,集成策略有效平衡了精确率与召回率,而软堆叠方法进一步增强了模型决策的可靠性与可解释性。特征分析揭示模型输出的概率值在最终预测中具有关键指导作用,验证了基于集成学习的LLM架构在生物医学文献自动筛查任务中的有效性、可靠性和可扩展性。
链接: https://arxiv.org/abs/2606.19345
作者: Zhyar Rzgar K. Rostam,Márta Péntek,János Tibor Czere,Zsombor Zrubka,László Gulácsi,Gábor Kertész
机构: University of Debrecen (德布勒森大学); Budapest University of Technology and Economics (布达佩斯技术与经济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 tables, 8 equations
Abstract:The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly resource consuming, inefficient, and inconsistent. Classifying studies that clearly report health-related quality-of-life results, such as EQ-5D data, requires a high level of clinical interpretation and poses challenges for human reviewers. This study investigates the use of Google’s Gemini and Gemma large language models (LLMs) in automating EQ-5D detection in the PubMed biomedical database based only on published abstracts. A multi-phase framework is proposed that integrates few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier. Nine LLMs are evaluated on a dataset of PubMed studies manually labeled by two experts regarding EQ-5D reporting. The weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b obtained a 0.74 weighted F1-score and 0.74 accuracy, exceeding individually attained results. The ensembling of top-performing models improved the balance between precision and recall compared to individual models, while the soft stacking approach provided greater reliability and interpretability. Feature analysis shows that the probability results from the models are important in guiding the final predictions. The findings suggest that an ensemble-based LLM setup is a reliable and scalable approach for automating screening in biomedical research.
[NLP-76] Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的表征偏见与句法偏见难以评估的问题,尤其针对传统审计方法因依赖单一输出观察或静态自动化指标而无法捕捉低概率生成路径中隐藏偏见的局限性。其核心解决方案是提出TreeTracer这一可视化分析工具,通过系统化的扰动分析流程,将输入提示中的本体定义术语进行替换,聚合数百次随机生成结果形成语法对齐的层次化结构,并利用辅助语言模型实现分类感知的节点合并;最终以自定义桑基图(Sankey diagram)呈现。该架构支持基于本体驱动的双树对比,可直接识别不同语义情境下的系统性偏见。此外,为避免可视化结果仅反映模型部分行为导致的误判,系统引入对比推理机制,直接计算并展示跨情境的反事实词元概率,从而提升偏见检测的可靠性。案例研究对比了未对齐的GPT-2 XL与宪法对齐的Apertus模型,验证了该方法在揭示隐蔽的表征伤害(如反事实代词抑制、对话中个体边缘化)方面的有效性;初步用户研究表明,该聚合式比较界面显著降低了认知负荷,提升了分析师对系统性偏见的识别能力。
链接: https://arxiv.org/abs/2606.19344
作者: Matteo Pelossi,Rita Sevastjanova,Thilo Spinner,Mennatallah El-Assady
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model’s learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.
[NLP-77] PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors INTERSPEECH2026
【速读】: 该论文旨在解决现有语音质量评估模型在预测语句级自然度评分(MOS)时对局部语调重音错误不敏感的问题。传统模型往往无法有效捕捉语调重音错误的严重程度及其对感知质量的影响,导致在不同说话人或错误类型下的评估结果失真。为此,论文提出了一种聚焦语调重音的语音质量评估方法——PASQA(Pitch-Accent-focused Speech Quality Assessment),其核心在于显式建模语调重音的正确性。关键解决方案包括:利用可控语调的文本转语音系统构建受控的日语语调错误数据集,并基于语调错误率计算伪语调质量得分以监督训练;采用自监督表征、基于音拍(mora)条件的融合机制、排序损失(ranking loss)、辅助的语调错误定位任务以及说话人无关训练策略。实验表明,传统模型无法保持语调错误严重程度的正确排序,而PASQA在已见与未见说话人上均实现了高排序准确率,并且与人工语调正确性判断具有更强的一致性。
链接: https://arxiv.org/abs/2606.20137
作者: Masaya Kawamura,Yuma Shirahata,Kentaro Mitsui,Reo Shimizu
机构: LY Corporation(日本LY公司)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to INTERSPEECH 2026
Abstract:Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at this https URL.
[NLP-78] Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations INTERSPEECH2026
【速读】: 该论文旨在解决生成式语音合成(Text-to-Speech, TTS)研究中普遍使用的平均意见分(Mean Opinion Score, MOS)预测模型在评估语音质量时,是否能够有效捕捉超出声学保真度(acoustic fidelity)的感知差异这一关键问题。其核心解决方案在于通过受控扰动实验,系统性地引入三类语音缺陷:声学退化、韵律错误以及说话人特异性特征(如基频F0和语速)的操纵,并对比人类听觉感知与模型预测之间的差异。研究发现,尽管多数MOS预测模型能较好反映声学退化的影响,但对显著影响主观评分的韵律错误普遍不敏感;同时,在说话人特征方面表现出双重分离现象:模型存在强烈的基频均值(mean F0)偏差,而人类听者却未察觉;相反,模型对语速及基频变异性的变化完全不敏感,而这些正是人类感知中显著的差异因素。上述结果揭示了当前标量型MOS预测模型在全面表征语音感知质量方面的根本局限性,尤其在非声学层面的细微差异捕捉能力不足。
链接: https://arxiv.org/abs/2606.19951
作者: Masato Takagi,Masaya Kawamura,Reo Shimizu,Yuma Shirahata
机构: Nagoya Institute of Technology (名古屋工业大学); LY Corporation (LY公司)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to INTERSPEECH 2026
Abstract:Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.
信息检索
[IR-0] Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation
链接: https://arxiv.org/abs/2606.20554
作者: Ruizhong Qiu,Yinglong Xia,Dongqi Fu,Hanqing Zeng,Ren Chen,Xiangjun Fan,Hong Li,Hong Yan,Hanghang Tong
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users’ next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.
[IR-1] Easy Reads: A Python program for making Scientific Papers on arXiv more Reader Friendly and Accessible
链接: https://arxiv.org/abs/2606.20550
作者: Vishal Verma
类目: Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 9 pages. Open-source software project available at: this https URL
Abstract:Scientific papers are frequently dense and characterized by features such as small fonts and line spacing, double columns of text, and tightly arranged figures. While these features make papers more compact, they can hinder readability, make them less accessible, and can strain the reader. arXiv is a premier open-access repository for scientific papers across different fields and is used extensively by researchers, including those in the physics and astrophysics communities. Easy Reads is an automated, end-to-end, open-source Python program that helps address the stated challenge by making papers from arXiv more reader-friendly and accessible. Easy Reads can automatically fetch a paper from arXiv via its URL and work with the source TeX file to allow custom formatting of the paper features, primarily the font size, and the number of columns used. The main goal of Easy Reads is to facilitate ease of reading of scientific papers. Comments: 9 pages. Open-source software project available at: this https URL Subjects: Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR) Cite as: arXiv:2606.20550 [cs.DL] (or arXiv:2606.20550v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2606.20550 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval ECCV2026
链接: https://arxiv.org/abs/2606.20280
作者: Yuhan Liu,Pei Fu,Hang Li,Yukun Qi,Chao Jiang,Jingwen Fu,Zhen Liu,Bin Qin,Zhenbo Luo,Jian Luan,Jingmin Xin
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by ECCV 2026
Abstract:Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness. Comments: Accepted by ECCV 2026 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.20280 [cs.IR] (or arXiv:2606.20280v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.20280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] ScholarQuest: A Taxonomy-Guided Benchmark for Agent ic Academic Paper Search in Open Literature Environments
链接: https://arxiv.org/abs/2606.20235
作者: Tingyue Pan,Mingyue Cheng,Daoyu Wang,Yitong Zhou,Jie Ouyang,Qi Liu,Enhong Chen
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Academic paper search is a core step in scientific research, and LLM-based search agents are emerging as a promising paradigm for iterative, intent-driven literature exploration. However, existing benchmarks are insufficient for systematically evaluating agentic academic search under realistic open literature environments. We propose ScholarQuest, a large-scale, taxonomy-guided benchmark for agentic academic paper search. ScholarQuest is constructed from over 1,000 computer science topics and four representative research intents, including method-oriented, setting-anchored, comparison-based, and scope-controlled queries. It further provides scalable answer construction and a shared retrieval backend ScholarBase for reproducible evaluation. Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement. In addition, analyses of search efficiency, intent-level robustness, and failure cases further highlight the benchmark’s ability to provide multi-dimensional evaluation signals for academic paper search agents.
[IR-4] When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2606.20113
作者: Elroy Galbraith
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism’s benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property – tool-intent stabilization, the point in the input stream at which a speculative query’s retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user’s remaining input, as a function of tool latency L and input cadence \delta, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, \delta=3w/s, \theta=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding – a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, \phi_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding – both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.
[IR-5] Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines
链接: https://arxiv.org/abs/2606.20065
作者: Pratyush Kumar(Ranqo)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 14 pages, 4 tables; v1.0 preprint
Abstract:People increasingly get answers straight from AI search engines like ChatGPT, Claude, Perplexity, and Gemini rather than scrolling search results. Brands that once focused on search engine optimization (SEO) must now optimize for how these engines represent, cite, and recommend them – a shift variously called Generative Engine Optimization (GEO), Answer Engine Optimization (AEO), and AI Search Visibility. We treat AEO and AI Visibility as part of GEO, and study how to measure brand visibility across AI engines: what they value when they cite a brand, which sources they rely on, and what content large language models surface. The hard case is everyone outside the already-authoritative top brands – SMEs, D2C brands, creators, and early-stage startups. We analyze 100K+ prompt responses across 100+ brands tracked on Ranqo between March and May 2026. First visibility runs form a clear three-tier brand-stature ladder: global household names (e.g., Stripe, Nike) appear in 73% of relevant AI answers on their first run; established mid-market and regional brands (e.g., Olipop, Klaviyo) in 44%; niche and small brands in just 11% – about 30 percentage points per step. When engines cite sources, about 78% go to corporate websites; among non-corporate sources YouTube leads, ahead of Reddit, editorial media, and Wikipedia. The highest-leverage page is the ranked “best-of” listicle, the most-cited content format at about 21% of all citations. Sentiment is the unstable signal: whether a brand is framed positively or negatively flips about 6.7 times more often than whether it is mentioned at all. These findings provide a first large-scale baseline for measuring GEO: AI brand visibility can be measured, differs by platform, and varies strongly by brand maturity. We close by proposing seven v1.1 protocols to test whether specific recommendations can causally improve AI visibility. Comments: 14 pages, 4 tables; v1.0 preprint Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY) ACMclasses: H.3.3 Cite as: arXiv:2606.20065 [cs.IR] (or arXiv:2606.20065v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.20065 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-6] PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents
链接: https://arxiv.org/abs/2606.20047
作者: Manu Ghulyani,Arunabh Singh,Karan Bharadwaj,Ankit Nath,Suranjan Goswami
类目: Information Retrieval (cs.IR)
备注:
Abstract:Conversational and tool-using LLM agents operate over a context window that fills from several directions simultaneously. As a session proceeds, the agent accumulates user and assistant turns, entries drawn from a persistent memory store, and often largest of all, the verbatim outputs of tool calls such as file reads, search results, and API responses. Once the cumulative context exceeds the model’s token budget, the framework must decide what to keep. The prevailing mechanism is recency truncation, sometimes paired with periodic summarization. This is topic-blind: a fact established early in a session is discarded simply because it is old, even when the current user query is about exactly that fact; conversely, verbose but irrelevant recent material is retained. Agents that must recall information across many turns, the defining case for memory, are precisely where recency truncation fails. Existing alternatives sit outside the agent’s assembly step. Retrieval augmented generation fetches external documents into the prompt but does not arbitrate the agent’s \emphalready-present pooled context. Context-compression methods reduce token count by rewriting or pruning text, but operate query-blind and lossily. Neither treats memory entries, conversation turns, and tool outputs as a single candidate pool to be selected from by relevance at the moment the prompt is assembled. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.20047 [cs.IR] (or arXiv:2606.20047v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.20047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-7] Stellar: Scalable Multimodal Document Retrieval for Natural Language Queries
链接: https://arxiv.org/abs/2606.19960
作者: Yuxiang Guo,Zhonghao Hu,Yuren Mao,Yuhang Liu,Congcong Ge,Xiaolu Zhang,Jun Zhou,Yunjun Gao
类目: Information Retrieval (cs.IR)
备注:
Abstract:Multimodal document retrieval–selecting the most relevant multimodal document from a large corpus to answer a natural language query–plays an essential role in Retrieval-Augmented Generation (RAG) systems. State-of-the-art methods represent each document and query with multiple token-level embeddings and use late interaction to achieve high effectiveness. However, such multi-vector representations incur substantial memory overhead during retrieval, leading to poor scalability and hindering real-world deployment. In this paper, we present Stellar, a scalable multimodal document retrieval framework that stores token-level document embeddings on disk and loads only a small set of candidate embeddings into memory for late interaction. Stellar comprises two key components: (i) Lexical Representation-based Filtering (LRF), which fine-tunes a Multimodal Large Language Model (MLLM) as a sparse encoder to produce high-quality lexical representations, enabling efficient and effective document filtering to significantly reduce the candidate set; (ii) Efficient Disk-backed Late Interaction (DLI), which designs an on-disk token embedding storage layout guided by a balanced clustering algorithm, and dynamically loads only the necessary token embeddings into memory using a simple yet effective cost model. Extensive experiments on four real-world benchmarks and a newly presented large-scale dataset demonstrate that Stellar reduces memory overhead and query latency by 1-2 orders of magnitude compared to existing methods without compromising retrieval effectiveness.
[IR-8] Multi-Agent Transactive Memory
链接: https://arxiv.org/abs/2606.19911
作者: To Eun Kim,Xuhong He,Dishank Jain,Ambuj Agrawal,Negar Arabzadeh,Fernando Diaz
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.
[IR-9] Query-aware Routing for Filtered Approximate Nearest Neighbors Search
链接: https://arxiv.org/abs/2606.19898
作者: Qianqian Xiong,Mengxuan Zhang
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注: 12 pages
Abstract:Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware routing framework. A lightweight ML model predicts each candidate method’s recall on the query, and the router consults an offline benchmark table that maps every method and parameter setting to its measured recall and QPS, then selects the method with the best recall–QPS trade-off. Our ablation study narrows 22 candidate features to a minimal set of three and we adopt regression rather than classification as the prediction target to sharpen accuracy. Our model is trained on six real-world datasets and applied to five unseen validation datasets. The final result shows that our router achieves state-of-the-art recall and QPS balance across all five validation datasets compared to existing filtered ANN baselines, while incurring negligible latency overhead.
[IR-10] Closing the Calibration Gap in Semantic Caching
链接: https://arxiv.org/abs/2606.19719
作者: Aditeya Baral,Radoslav Ralev,Iliya Sotirov Zhechev,Srijith Rajamohan,Jen Agarwal
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 2 figures. Source code: this https URL ; Models and Datasets: this https URL
Abstract:Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset’s positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.
[IR-11] When Global Gating Is Enough: Admission-Time Hubness Control in Anisotropic Vector Retrieval Systems
链接: https://arxiv.org/abs/2606.19692
作者: Prashant Kumar Pathak,Tarun Kumar Sharma
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:Vector hubness, where a few points become nearest neighbors of many queries, creates a poisoning risk in retrieval-augmented generation (RAG): one injected document can influence unrelated requests. Existing defenses use periodic reverse-kNN scans, leaving an exposure window and repeated corpus-wide work. We study admission-time control, scoring each candidate against sentinel queries and quarantining hub-like documents before insertion. Across two 100,000-document corpora, five encoders, and disjoint attacker and defender query sets, a global gate achieves recall 1.0 at the decisive embedding-space point (=0.92 across the effective range) and 0.91 +/- 0.07 on HotFlip attacks, with 1% false positives on general documents. A per-topic gate provides no reliable benefit, consistent with anisotropy coupling local and global visibility. Thresholds are maintained incrementally, with corpus-size-independent insertion cost and amortized deletion cost. On HNSW, admission adds about 3.1% to ingestion latency, scoring remains flat to 10^6 vectors, and 1.2% of decisions flip under approximate indexing, none involving attacks. Provenance complements the gate for natural or tight-domain hubs.
[IR-12] Denoising Implicit Feedback for Cold-start Recommendation KDD2026
链接: https://arxiv.org/abs/2606.19658
作者: Gaode Chen,Shicheng Wang,Shikun Li,Rui Huang,Xinghua Zhang,Yunze Luo,Shipeng Li,Shiming Ge,Ruina Sun,Yinjie Jiang,Jun Zhang
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by KDD 2026 ADS Track
Abstract:Implicit feedback is widely used in recommender systems due to its accessibility and generality, yet it usually presents noisy samples (e.g., clickbait, position bias). Meanwhile, recommenders inevitably face the item cold-start problem due to the continuous influx of new items. We identify that cold items are more prone to noisy samples due to the aforementioned factors, and researchers often overlook the significance of denoising implicit feedback for cold items. Previous denoising studies usually identify noisy samples based on heuristic patterns, such as higher loss values, and mitigate noise through sample selection or re-weighting. However, these methods have limited adaptability and are ineffective in cold-start scenarios. To achieve denoising implicit feedback for cold-start recommendation, we propose a model-agnostic denoising method called DIF. First, user preferences for content remain stable, which allows us to infer pseudo-labels indicating whether a user is interested in a cold item through content-similar warm items. Furthermore, to improve pseudo-label accuracy, we model the confidence of pseudo-labels based on the content similarity between the cold item and warm items, and then aggregate multiple pseudo-labels for each sample. Finally, we explicitly estimate the uncertainty of the noisy sample label by considering its relative entropy and the cold-start status of the item, which adaptively guides the role of pseudo-labels to correct the noisy labels at the sample level. DIF’s superiority is supported by both theoretical justification and extensive experiments on real-world datasets. The method has been deployed on a billion-user scale short video application Kuaishou and has significantly improved various commercial metrics within cold-start scenarios.
[IR-13] SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering CIKM2026
链接: https://arxiv.org/abs/2606.19646
作者: Ayush Dwivedi,Qixin Wang,Ashvi Soni,Ruoteng Wang,Han Li,Animesh Mahapatra,Neeraj Agrawal,Xintao Wu
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Demo paper submitted at CIKM 2026. 4 pages, 2 figures
Abstract:Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.
[IR-14] oken Factory: Efficiently Integrating Diverse Signals into Large Recommendation Models
链接: https://arxiv.org/abs/2606.19635
作者: Xilun Chen,Shao-Chuan Wang,Baykal Cakici,Lukasz Heldt,Lichan Hong,Raghu Keshavan,Aniruddh Nath,Li Wei,Xinyang Xi
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 10 figures
Abstract:Large Recommendation Models (LRMs) have demonstrated promising capabilities in industry-scale recommendation tasks. However, holistically integrating traditional signals into these transformer-based architectures effectively and efficiently remains a major challenge. Conventional approaches that “textualize” these signals directly or create discrete item representations often lead to excessively long prompts, substantial memory footprints, and high computational overhead. To overcome these limitations, we propose “Token Factory”, a framework designed to transform traditional signals into “soft tokens” that can be directly processed by LRMs. This approach enables efficient integration and compression of heterogeneous input features, preventing prompt length explosion while enhancing model performance. We detail the architecture of Token Factory and present experimental results validating its effectiveness in a production-scale recommendation environment.
[IR-15] VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions
链接: https://arxiv.org/abs/2606.19627
作者: Katya Mirylenka,Egor Malykh,Mahdyar Ravanbakhsh,Michael Gygli,Marco-Andrea Buchmann,Andrew Dzhoha,Svitlana Borzenko,Francesca Catino,Mohamed Gaafar,Maarten Versteegh,Thomas Kober,Dario d’Andrea,Ellie Langhans
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start’’ problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system’s architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50% uplift in deep video completion. To showcase the system’s capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.
[IR-16] MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems
链接: https://arxiv.org/abs/2606.19458
作者: Oğuzhan Yenen
类目: Information Retrieval (cs.IR)
备注: 27 pages, 11 figures. Code and artifacts: this https URL (PyPI: monavec; this http URL : monavec-core). Zenodo: doi: https://doi.org/10.5281/zenodo.20559587
Abstract:We present MonaVec, a deterministic, embedded vector-search kernel for edge and offline AI – settings where server infrastructure, network connectivity, and training data are all unavailable. Existing vector-search systems assume a persistent server, gigabytes of RAM, or a training pass over the corpus; MonaVec instead targets the deployment profile of SQLite: one file, one function call, runs anywhere. Its quantization core is training-free by default and data-oblivious: a Randomized Hadamard Transform (RHDH) conditions any input distribution toward N(0,1), so precomputed Lloyd-Max tables quantize to 4 bits (8x smaller) with no learned codebook and no data pass. The index persists as a single .mvec file whose embedded ChaCha20 rotation seed makes results reproducible across architectures and byte-identical within a build – a determinism guarantee that parallel-build graph libraries cannot offer. On semantic embeddings (AG News, 45K x 1024-dim BGE-M3, cosine), MonaVec 4-bit BruteForce reaches 0.960 Recall@10 in 27 MB – leading float32 FAISS-IVF and 8-bit usearch on recall – while trading peak throughput for byte-identical determinism. A single-pass global standardization (fit()) extends the same data-oblivious pipeline to magnitude-sensitive L2 data, and optional IvfFlat and HNSW backends carry it to million-vector corpora. MonaVec is implemented in pure Rust with Python bindings and runtime SIMD dispatch (AVX-512/AVX2/NEON/scalar). It targets on-device RAG, offline agents, and embedded retrieval – the niche SQLite occupies for relational data: one file, one call, runs anywhere. Comments: 27 pages, 11 figures. Code and artifacts: this https URL (PyPI: monavec; this http URL: monavec-core). Zenodo: doi:https://doi.org/10.5281/zenodo.20559587 Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3; E.4 Cite as: arXiv:2606.19458 [cs.IR] (or arXiv:2606.19458v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.19458 Focus to learn more arXiv-issued DOI via DataCite
[IR-17] Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees
链接: https://arxiv.org/abs/2606.19376
作者: Herbert Woisetschläger,Arastun Mammadli,Ryan Zhang,Shiqiang Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Preprint. Under review
Abstract:Inference costs for large language model (LLM) applications are rapidly growing, driven by surging demand and rising infrastructure cost. Users expect high-quality responses, and in commercial settings this is formally codified in Service Level Agreements (SLAs), creating a fundamental tension between cost and quality. Recent progress on cost-aware LLM request routing has shown potential to resolve this tension, but existing approaches rely on complete feedback signals, offline training, extensive per-workload tuning, and most lack SLA guarantees or inference-time adaptivity. We introduce SLARouter, an online routing algorithm that learns a cost-optimal policy from the sparse, one-sided user feedback available in production systems. SLARouter provides theoretical guarantees for both cost optimality and strict SLA compliance. Experiments across a wide range of LLM benchmarks show that SLARouter satisfies SLA constraints without the need for per-benchmark tuning, reducing operating cost by up to 2.2x over existing baselines.
人机交互
[HC-0] Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users
链接: https://arxiv.org/abs/2606.20482
作者: Haw-Shiuan Chang,Jeffrey Gomez,Mehul Patwari,Aryan Sajith,Hamed Zamani
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To quantify the value of implicit feedback, we build a new dataset called IFLLM, which collects 1336 multi-turn questions from the 59 Mechanical Turk workers, their mouse trajectories, and eye gazing points to the LLMs’ responses from their webcams. IFLLM shows that the users have very diverse types of gazing behavior and mouse trajectories. Our reward model based on the implicit user feedback boosts the accuracy of the text-based reward model from 55% to 64% and nearly triples the relative response quality improvements after applying the DPO to eight LLMs, demonstrating the value of implicit feedback in the wild. Our data collection website, dataset, and codes can be found at this https URL.
[HC-1] Directors Duties in the Age of Agent ic Artificial Intelligence
链接: https://arxiv.org/abs/2606.20453
作者: Deirdre Ahern
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:As boards engage with the adoption of Artificial Intelligence including agentic AI to drive operational efficiencies, this presents new opportunities for profit maximisation. AI adoption is increasingly identified with employee role displacement and in companies, and the interests of employees as stakeholders require exploration. A novel question posed is whether in an age of AI ascendancy AI may warrant being given stakeholder status as its role in the company approximates or eclipses that of human employees. The article probes four distinct models of corporate purpose within the duty on directors to act in the best interests of the company, the shareholder primacy model, the Enlightened Shareholder value model, the stakeholder friendly model, and the stakeholder value model, highlighting the available scope for directors to accommodate the interests of employees around AI adoption in decision-making by boards around AI. It is concluded that given the degree to which directors are insulated from legal scrutiny in relation to their best interests duty, adopting a wider law in context approach to promote employee welfare would serve the interests of employees, directors and companies alike. This would see directors engaging meaningfully with employees and providing opportunities for reskilling to adapt to the age of AI.
[HC-2] DataMagic: Transforming Tabular Data into Data Insight Video VLDB2026
链接: https://arxiv.org/abs/2606.20388
作者: Yupeng Xie,Chen Ma,Zhenyang Wang,Liangwei Wang,Jiayi Zhu,Chuxuan Zeng,Zhouan Shen,Boyan Li,Yuyu Luo
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 5 pages, 3 figures, accepted at VLDB 2026
Abstract:Data videos integrate dynamic charts, voice narration, and synchronized animations to communicate data insights as temporal narratives, making them an effective medium for improving data consumption efficiency in the data management lifecycle. However, producing high-quality data videos requires expertise spanning data analysis, narrative design, and video production. Existing approaches fall short: static visualization tools (e.g., BI dashboards) lack narrative logic and animation; authoring tools require users to pre-prepare visualizations rather than working from raw data; pixel-level video generation models cannot guarantee data fidelity or provenance. We demonstrate DataMagic, an end-to-end interactive system that transforms raw tabular data and natural language queries into narrative data-insight videos. To ensure data fidelity, DataMagic introduces the declarative specification DVSpec, which binds visual and animation elements to underlying data fields through data-driven semantic references. To address the combinatorial explosion of the design space, DataMagic adopts a Generate-then-Orchestrate multi-agent architecture that generates candidate scenes in parallel and then optimizes narrative coherence through global orchestration. Leveraging DVSpec’s decoupling of logic and rendering, the system further supports three interaction modes and structured provenance-based data QA, transforming one-way videos into explorable interactive data interfaces. Evaluation on 109 real-world samples validates the effectiveness of the DataMagic. Homepage: this https URL
[HC-3] Organizing in the Digital Age: Understanding Community Challenges and Consequences in Digitally-facilitated Labor Organizing
链接: https://arxiv.org/abs/2606.20375
作者: Frederick Reiber,Alishah Chator,Dana Calacci,Allison McDonald
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: To appear in CSCW 2026
Abstract:The contemporary American labor force is highly dispersed, necessitating the use of digital communication tools to bridge spatial and temporal gaps in union organizing. This study provides an in-depth analysis of how workers within various labor unions utilize digital, text-based communication platforms – including Discord, WhatsApp, and Slack – for labor organizing. Through 17 qualitative interviews, we examine the challenges and opportunities presented by digital organizing, identifying both technical and social obstacles. Our findings reveal that although digital tools are integral to contemporary labor successes, they also introduce new complexities, such as navigating technical security, managing information overload, and building trust and consensus. Based on these insights, we draw connections to broader understandings of digital organizing and the role of digital tools in unions.
[HC-4] Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM -mediated Knowledge Dissemination
链接: https://arxiv.org/abs/2606.20258
作者: Simon Aagaard Enni,Malthe Stavning Erslev,Karl-Emil Kjær Bilstrup,Kristoffer Laigaard Nielbo
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:The emergence of LLM-driven information services is reshaping the conditions under which public knowledge institutions operate, threatening to absorb the editorial function these institutions exist to exercise. While LLMs offer powerful new affordances for knowledge dissemination, editorial authority is challenged by pretrained LLMs that arrive already aligned with the values and dissemination strategies of their commercial developers. This paper investigates editor participation in re-aligning LLM interfaces to editorial standards through design workshops, in a case study where we design and implement an LLM-enabled encyclopedia interface with a Nordic public knowledge institution. We introduce editorial alignment as a design practice within Participatory AI, framing AI alignment as a design process and positioning the editorial standard as a design artefact that translates editorial practice and values into alignment objectives for technical implementation. Last, we discuss how editorial alignment can create space for ongoing participation and give editors agency in LLM-mediated knowledge dissemination.
[HC-5] Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact
链接: https://arxiv.org/abs/2606.20205
作者: Jelena Meyer,David Garcia,Dirk U. Wulff
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument’s apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.
[HC-6] Learning to Prompt: Improving Student Engagement with Adaptive LLM -based High-School Tutoring
链接: https://arxiv.org/abs/2606.20138
作者: Po-Chin Chang,Nicholas Hogan,Aske Plaat,Michiel T. van der Meer
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ( 0.694 vs. 0.647 and 0.64 , p0.001 ). A/B testing ( N=656 conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ( p=0.007 ). While a greedy router achieves a comparable exercise conversion rate with the baseline ( 19.1% vs. 19.6% ), a stochastic router that samples strategies leads to a higher conversion rate ( 28.1% ).
[HC-7] AI Conversational Interviewing: Scaling Up Semi-Structured and In-depth Interviews
链接: https://arxiv.org/abs/2606.20064
作者: Alexander Wuttke,Max Melchior Lang,Christopher Klamm,Quirin Würschinger,Frauke Kreuter
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Public opinion research has long faced a trade-off between depth and scale: standardized surveys enable large-scale measurement but restrict respondents to researcher-defined categories, obscuring the diversity of unexpected considerations that underlie public sentiment. More conversational interviews provide richer insights through open-ended probing, but their reliance on trained human interviewers has kept them difficult to scale. This study introduces AI Conversational Interviewing as a method for collecting open-ended public opinion data at scale, pursuing three objectives: to demonstrate the analytical value of conversational text data for questions beyond the reach of closed-ended items; to assess the method’s practical viability through participants’ own evaluations; and to inform implementation by experimentally comparing voice-based, chat-based, and free-choice interview modes. We conducted a study combining an AI-led interview with a standardized survey on migration policy among 571 respondents recruited via Prolific and Payback Panel. The findings establish AI Conversational Interviewing as a viable and valuable addition to the social-science toolkit. The conversational transcripts surface considerations and reasoning that a comprehensive standardized battery does not capture such as markedly different mental models of migration among subgroups with similar attitudes levels. Among respondents who completed the interview, evaluations of the AI interview were at or above those of the standardized survey across modes, although completion itself varied by condition. By releasing open data and open-source pipeline materials, the study contributes to a growing literature on harnessing artificial intelligence to expand the methods of public opinion measurement.
[HC-8] MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization
链接: https://arxiv.org/abs/2606.19930
作者: Guangyi Liu,Pengxiang Zhao,Gao Wu,Yiwen Yin,Mading Li,Liang Liu,Congxiao Liu,Zhang Qi,Mengyan Wang,Liang Guo,Yong Liu
类目: Human-Computer Interaction (cs.HC)
备注: Project page: this https URL
Abstract:MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but adapting them to real target apps remains costly because mobile apps are numerous, frequently updated, and hard to cover with human-written tasks, demonstrations, or reward labels. Existing annotation-free GUI learning reduces manual supervision, yet lacks a unified substrate connecting target-app exploration, curriculum mining, rollout execution, and feedback, while policy optimization often relies on isolated rollouts and coarse rewards that are hard to convert into reliable improvement signals. We present MobileForge, an annotation-free adaptation system for mobile GUI agents. MobileForge consists of MobileGym, which grounds task generation and rollout evaluation in real mobile app interaction, and Hierarchical Feedback-Guided Policy Optimization (HiFPO), which turns trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates. Using only automatically generated annotation-free adaptation data, MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld, close to the closed-data GUI-specialized GUI-Owl-1.5-8B base model at 69.0%. The MobileForge-adapted ForgeOwl-8B further reaches 77.6% Pass@3 on AndroidWorld and 41.0% success on the out-of-domain MobileWorld GUI-only split, establishing the strongest open-data mobile GUI agent in our evaluation. Code, data, and trained models will be released at this https URL.
[HC-9] MemGUI-Agent : An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management
链接: https://arxiv.org/abs/2606.19926
作者: Guangyi Liu,Gao Wu,Congxiao Liu,Pengxiang Zhao,Liang Liu,Mading Li,Qi Zhang,Mengyan Wang,Liang Guo,Yong Liu
类目: Human-Computer Interaction (cs.HC)
备注: 33 pages, 6 figures. Project page: this https URL
Abstract:MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at this https URL.
[HC-10] Designing for Interconnected Islamic Learning: A Qualitative Study of Muslim Womens Experiences with Quran Hadith and Seerah Apps
链接: https://arxiv.org/abs/2606.19745
作者: Ishrat Jahan Easha(1 and 2),Nabil Mosharraf Hossain(3),Araf Mohammad Mahbub(3),Fairoze Bint Abu Hassan(3),Zunaid Aslam(3),Yemin Sajid(3),Riasat Islam(3 and 4) ((1) University of Technology Sydney, (2) ZNRF University of Management Sciences, (3) Greentech Apps Foundation, (4) Queen Mary University of London)
类目: Human-Computer Interaction (cs.HC)
备注: 27 pages, 1 figure, 3 tables. Submitted to the International Journal of Human-Computer Interaction
Abstract:Islamic learning often depends on reading the Qur’an, Hadith, and Seerah together, yet digital tools typically separate these sources across apps, screens, and search pathways. We examine this as a human-computer interaction problem through five semi-structured interviews with Muslim women recruited from an online Islamic learning community. Participants described a recurring tension: they wanted Qur’an-Hadith-Seerah context at the point of reading, but only when contextual expansion remained trustworthy, optional, and did not interrupt reading. Interpreting the interviews through gendered digital religion, epistemic trust, and seamless learning, we identify five themes concerning contextual understanding, authenticity, interface clutter, study modes, and guidance features. We introduce layered contextuality as an HCI account of this domain: contextual expansion must be balanced with interpretive accountability, devotional flow, and continuity across devices and study intensities.
[HC-11] Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings EMNLP2026
链接: https://arxiv.org/abs/2606.19744
作者: Pranav Bhandari,Nicolas Fay,Amitava Datta,Usman Naseem,Mehwish Nasim
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Submitted to EMNLP 2026
Abstract:Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.
[HC-12] Vibe Coding for Visualization Implementation: An Empirical Study of Practices and Challenges
链接: https://arxiv.org/abs/2606.19703
作者: Zhengyu Sun,Xiaolin Wen,Fengjie Wang,Can Liu,Yi Lai,Christophe Hurter,Yong Wang
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 2 figures. Short paper under review
Abstract:Data visualization is essential for data analysis and communication, yet creating expressive visualizations remains labor-intensive. Recent AI-driven ``vibe coding’’ tools enable users to generate visualizations through natural language interaction, lowering the barrier to entry. However, visualization implementation requires precise alignment between user intent and visual representation, which may differ from general software development practices. We present an empirical study with 16 participants of varying expertise to examine how users employ vibe coding tools for visualization implementation. Participants completed two visualization tasks and a semi-structured interview. Our findings characterize the diverse practices users adopt across prompting, evaluation, and iteration, and surface the challenges they encounter throughout the process.
[HC-13] Syndesmoscope: The Power of Invariant PlotsLinked to Traditional Network Views
链接: https://arxiv.org/abs/2606.19689
作者: Matt Oddo,Indira Sowy,Stephen Kobourov,Tamara Munzner
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Traditional network representations, such as node-link views and adjacency matrices, can show dramatically different visual patterns, depending on the underlying layout or seriation algorithm. In contrast, invariant plots consistently surface the same visual pattern for the same input topology; yet researchers have underexplored them and have not integrated them into visualization systems. We present Syndesmoscope, an interactive system for network exploration that juxtaposes multiple views of the same network. Panes show a familiar a force-directed view alongside three panes with interpretable geometric layouts based on graph-theoretic properties: dense-sparse gradient, geodesic eccentricity, and spectral bisection. As a secondary contribution, we introduce kSnakes, a new invariant plot based on density decomposition. Syndesmoscope supports two key interactions: leapfrogging, or linked highlighting between different and interpretable visual patterns; and hopscotching, or hop-based traversal that extends data selections through the underlying topology. Through usage scenarios across a corpus of 72 diverse networks, we demonstrate how these interactions reveal network patterns inaccessible through any single view alone. Live demo available at this https URL.
[HC-14] Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language ACL2026
链接: https://arxiv.org/abs/2606.19640
作者: Yunkai Xu,Saeed Abdullah
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026
Abstract:AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.
[HC-15] Building Drift: Documenting On-Site Construction Adaptations Across Material Lifecycles
链接: https://arxiv.org/abs/2606.19609
作者: Ritik Batra,Martin Tamke,Tom Svilans,Jan Hüls,Amritansh Kwatra,Steven J. Jackson,Thijs Roumen,Mette Ramsgaard Thomsen
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注: In submission
Abstract:In a circular economy for construction, reclaimed materials carry prior lives of use and go on to have post-lives in future buildings. Yet working with such materials introduces unpredictability that requires on-site improvisation, making their reuse challenging to document and scale across building lifetimes. Without documentation, the on-site adaptations that make construction with reclaimed materials possible leave collaborators, evaluators, and inheritors without the information they need to continue, assess, and reuse materials. We call the collective deviation of the physical state from the digital model through these adaptations “building drift.” Through a case study, ReShelter, a reclaimed timber pavilion constructed in the forest, we develop a taxonomy for building drift that characterizes the collective deviation across building lifetimes: Tending the Site, Foraging for Fit, Interpreting the Material, Marking Measurements, and Coordinating Across Communities. To put our taxonomy for building drift into practice, we present Pentimento, a documentation tool that leverages video documentation and 3D Gaussian Splatting to spatially, temporally, and semantically represent on-site adaptations in relation to the designed model. Pentimento enables each stakeholder to navigate material histories in ways that reduce barriers to material reuse. Together, these contributions open pathways towards computational tools that support the on-site improvisation essential to construction with reclaimed materials, enabling more sustainable cycles of recovery, repair, and reuse. Comments: In submission Subjects: Human-Computer Interaction (cs.HC); Graphics (cs.GR) Cite as: arXiv:2606.19609 [cs.HC] (or arXiv:2606.19609v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.19609 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-16] Code as Anchor Memory and Metaphor as Support: Learner Experiences with Multi-View Visualizations
链接: https://arxiv.org/abs/2606.19570
作者: Naaz Sibia,Jessica Wen,Amber Richardson,Yashika Jain,Khushi Malik,Bogdan Simion,Carolina Nobre,Angela Zavaleta Bernuy,Andrew Petersen,Michael Liut
类目: Human-Computer Interaction (cs.HC)
备注: Pre-Print of a paper to be published at the International Computing Education Research (ICER) conference 2026
Abstract:Program visualizations are widely used to support novice programmers, yet students often ignore or resist well-designed visual scaffolds. Research on multiple external representations (MERs) offers cognitive design principles for coordinating views, but less is known about what shapes learners’ engagement with available representations. We conducted a within-subjects study with 19 undergraduates who had completed CS1 and CS2. Students completed think-aloud tasks, reflective interviews, and webcam-based gaze tracking while using a multi-representational probe with synchronized code, memory, and metaphor views, and Python Tutor, across scope, while loops, and linked lists. Gaze analysis showed that students spent nearly half their time focused on code despite available visual scaffolds. Students without prior experience anchored even more heavily in code and engaged minimally with metaphor views. Interviews identified three factors shaping selective engagement: agency, as students sought control over cognitive effort rather than simply having it reduced; representational fit, as identical designs differed in whether they felt helpful or overwhelming; and legitimacy, as some students avoided metaphorical scaffolds they perceived as childish or insufficiently rigorous for university-level work. These findings suggest that multi-representational tools in computing education require attention to affective and social factors alongside cognitive design. Practical considerations include positioning visualizations as verification instruments, offering toggleable abstraction levels, and framing tools to signal disciplinary legitimacy. More broadly, the themes help explain why cognitively sound visualization tools may fail to engage the students they are designed to help. Comments: Pre-Print of a paper to be published at the International Computing Education Research (ICER) conference 2026 Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.19570 [cs.HC] (or arXiv:2606.19570v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.19570 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3765964.3811662 Focus to learn more DOI(s) linking to related resources Submission history From: Naaz Sibia [view email] [v1] Wed, 17 Jun 2026 20:15:46 UTC (4,323 KB)
[HC-17] LLM -Mediated Human-AI Interaction in Search and Rescue: Impact of Expertise on Attentional Allocation
链接: https://arxiv.org/abs/2606.19514
作者: Elahe Oveisi,Hemanth Manjunatha
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Human-AI teaming (HAT) increasingly involves AI systems that provide real-time, context-aware guidance in complex tasks. While such systems can improve performance, their effectiveness depends on how they shape human cognition and behavior. In particular, AI assistance can introduce cognitive demands and influence attention, planning, and interaction with the task environment, with effects that can vary across levels of expertise. This work investigates these mechanisms in a simulated search and rescue (SAR) environment. We compare human performance under two LLM (Large Language Model)-guided conditions and a no-LLM baseline, and analyze interaction at multiple levels, including task performance, eye-tracking measures, and planning behavior. Eye tracking provides fine-grained insight into attention allocation and interaction with AI guidance, while behavioral measures capture how users structure and adapt their decisions over time. Results indicate that LLM guidance enhanced task efficiency (higher rewards and victims-per-step) but did not increase total victims saved. Eye-tracking data revealed an attention-guidance trade-off, with visual resources shifting to the chat interface alongside increased pupil size variability. Expertise moderated this effect: novices exhibited passive AI reliance, whereas experts maintained a “verification loop” through persistent environmental scanning. These findings suggest that LLM-mediated teaming efficacy depends on the operator’s ability to cross-reference AI guidance with ground truth to maintain situational awareness.
[HC-18] Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?
链接: https://arxiv.org/abs/2606.19388
作者: Li Gu,Zihuan Jiang,Linqiang Guo,Zhixiang Chi,Ziqiang Wang,Huan Liu,Yuanhao Yu,Tse-Hsun Chen,Yang Wang
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8% and 51.9%, outperforming every reproducible GUI baseline (69.3/68.1/57.8% on AndroidWorld; 43.2/26.3/13.3% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm’s ceiling, we provide oracle CLI solutions that reach 88.8% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbfCLI-Advantage Task Suite, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.
[HC-19] Human-AI Agent Interaction in a Business Context
链接: https://arxiv.org/abs/2606.18716
作者: Kathrin Paimann,Elizangela Valarini,Sebastian Juhl
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 tables, 1 figure, submitted to Springer Nature
Abstract:As AI agents are increasingly integrated into core business processes, understanding and designing effective interaction patterns between humans and AI agents becomes crucial for value creation. This study identifies and evaluates principles and criteria for a positive User Experience (UX) with AI agents, along with methods for its measurement. We identify user expectations and needs to facilitate adoption, build trust, and support user-centered decision-making by development teams. Using a mixed-methods approach that combines qualitative and quantitative techniques, we explore interaction patterns between humans and AI agents. The findings from this exploratory research serve as the basis to develop a survey experiment which evaluates the effectiveness of specific design elements on a larger scale. This foundational research contributes to the development of more intuitive and effective human-AI agent interactions in business settings.
计算机视觉
[CV-0] JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising ECCV2026
链接: https://arxiv.org/abs/2606.20563
作者: Siang-Ling Zhang,Huai-Hsun Cheng,Tsung-Ju Yang,Yu-Lun Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026. Project page: this https URL
Abstract:Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: this https URL
[CV-1] meProVe: Propose then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living
链接: https://arxiv.org/abs/2606.20561
作者: Arkaprava Sinha,Dominick Reilly,Siddharth Krishnan,Hieu Le,Srijan Das
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer–evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.
[CV-2] UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning
链接: https://arxiv.org/abs/2606.20559
作者: Wenhao Chi,Arkaprava Sinha,Dominick Reilly,Hieu Le,Srijan Das
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.
[CV-3] hinking in Boxes: 3D Editing in Real Images Made Easy
链接: https://arxiv.org/abs/2606.20556
作者: Pradhaan S Bhat,Naveen Chandra R,Rishubh Parihar,Vaibhav Vavilala,R. Venkatesh Babu,D.A. Forsyth,Anand Bhattad
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing – particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes’’ interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages – on synthetic multi-object scenes and a small set of real-world videos from Objectron – the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.
[CV-4] he Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups
链接: https://arxiv.org/abs/2606.20547
作者: Przemyslaw Musialski
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO); Differential Geometry (math.DG)
备注: preprint, 19 pages, 3 figures
Abstract:We place the attention token on the group: a token is an element g_i of a matrix Lie group G – a bare transformation, with no feature payload and no external action \rho(g) carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, g_i^-1 g_j , so the pairwise invariant w_ij = \log(g_i^-1 g_j) is intrinsic rather than designed; equivariance under the diagonal G -action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, s_ij = -|\log(g_i^-1 g_j)|_\lambda^2/\tau : the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.
[CV-5] Current World Models Lack a Persistent State Core
链接: https://arxiv.org/abs/2606.20545
作者: Jinpeng Lu,Dexu Zhu,Haoyuan Shi,Linghan Cai,Guo Tang,Yinda Chen,Jie Cao,Duyu Tang,Yi Zhang,Yong Dai,Xiaozhu Ju
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages, 16 figures
Abstract:World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbfWRBench, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9,600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.
[CV-6] SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation
链接: https://arxiv.org/abs/2606.20543
作者: Shilong Xiang,Zirui Zhang,Lijun Yu,Chengzhi Mao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.
[CV-7] CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation
链接: https://arxiv.org/abs/2606.20542
作者: Ilona Demler,Xinran Xie,Blake Werner,Anna Szczuka,Pietro Perona
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.
[CV-8] he FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation
链接: https://arxiv.org/abs/2606.20536
作者: Nicolas Dufour,Alexei A. Efros,Patrick Pérez
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL
Abstract:The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. © Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.
[CV-9] VisDom: Sparse Novel View Synthesis with Visible Domain Constraint
链接: https://arxiv.org/abs/2606.20531
作者: Mariia Gladkova*,Tarun Yenamandra*,Edmond Boyer,Robert Maier,Tony Tung,Daniel Cremers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sparse novel view synthesis (NVS) remains challenging due to the ambiguity of recovering 3D geometry from few input views. While NeRF- and Gaussian Splatting (GS)-based methods perform well with dense supervision, they often overfit in sparse settings, producing floating artifacts and inconsistent geometry. Silhouette consistency is commonly used as a regularizer, but it remains insufficient, as silhouette-consistent regions can extend beyond the true object geometry. We introduce VisDom, a learning-free geometric constraint that augments classical carving-based visual hull reconstruction by enforcing a minimum multi-view visibility requirement. Specifically, we define a visible domain as the subset of 3D space observed by at least K views and use it as an additional filtering criterion on top of standard silhouette-based reconstruction. This provides a stronger spatial prior in sparse-view settings. We integrate VisDom into both implicit (NeRF) and explicit (GS) pipelines by restricting volumetric sampling and guiding Gaussian placement during optimization. Experiments on three challenging datasets show consistent improvements in sparse-view NVS, enabling high-quality object-centric reconstruction from as few as four input images. Our method is domain-agnostic, requires only silhouettes, and introduces no learned parameters, making it a simple complement to existing approaches. Applying VisDom on top of GaussianObject further improves performance on Omni3D and MipNeRF360, while matching or surpassing it at 22 \times lower training cost.
[CV-10] SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm
链接: https://arxiv.org/abs/2606.20523
作者: Solène Debuysère,Nicolas Trouvé,Nathan Letheule,Elise Colin,Georgia Channing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR–optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR–optical–text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm–2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision–language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at this https URL.
[CV-11] HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
链接: https://arxiv.org/abs/2606.20521
作者: Juncheng Ma,Jianxin Bi,Yufan Deng,Xuanran Zhai,Kewei Zhang,Ye Huang,Bo Liang,Shukai Gong,Jiankai Tu,Xiaotian Tang,Jiaxin Li,Kaiqi Chen,Duomin Wang,Yuqi Wang,Bingyi Kang,Eric Huang,Zhiyang Dou,Zhen Dong,Enze Xie,Wojciech Matusik,Tat-Seng Chua,Daquan Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Github: this https URL
Abstract:Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.
[CV-12] S-Agent : Spatial Tool-Use Elicits Reasoning for Spatial Intelligence
链接: https://arxiv.org/abs/2606.20515
作者: Yalun Dai,Hao Li,Shulin Tian,Runmao Yao,Yuhao Dong,Fangzhou Hong,Zhaoxi Chen,Fangfu Liu,Baoliang Tian,Dingwen Zhang,Tao Wang,Kim-Hui Yap,Ziwei Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page : this https URL
Abstract:Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf\textscS-Agent, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textscS-Agent reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textscS-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textite.g., counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textscS-Agent consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textscS-Agent-generated spatial trajectories \textscS-300K yields \textscS-Agent-8B, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).
[CV-13] FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining
链接: https://arxiv.org/abs/2606.20506
作者: Jinghong Lan,Wei Cheng,Yunuo Chen,Ziqi Ye,Peng Xing,Yixiao Fang,Rui Wang,Yufeng Yang,Xuanyang Zhang,Xianfang Zeng,Difan Zou,Gang Yu,Chi Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 26figures. Project page: this https URL
Abstract:Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style this http URL recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style this http URL this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA this http URL treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base this http URL address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference this http URL also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage this http URL experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.
[CV-14] Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation IROS2026
链接: https://arxiv.org/abs/2606.20491
作者: Fatma Youssef Mohammed,Grzegorz Malczyk,Kostas Alexis
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
Abstract:Human visual attention relies on structured scanpaths to efficiently process scenes, yet instilling this behavior into robot autonomy is in its infancy and hindered by the high,computational costs of existing predictive models. To address this, we introduce GazeLNN, a computationally lightweight,scanpath prediction model that leverages Liquid Neural Networks as its recurrent engine and employs MobileNetV3 for feature extraction. Operating auto-regressively, the architecture predicts sequential fixation heatmaps conditioned on the current visual stimulus and fixation history. Despite requiring only 0.61 GFLOPs, GazeLNN achieves state-of-the-art performance on the MIT Low Resolution dataset achieving 0.47 ScanMatch score. It outperforms existing recurrent baselines across diverse evaluation metrics, while reducing computational costs by 99.40% and accelerating inference by up to six times. To investigate the role of human attention modeling in robot autonomy and demonstrate the practical utility of this highly efficient architecture, we integrate GazeLNN into an active camera-robot control policy trained via Reinforcement Learning. This integration enables human-fixation-guided perception during autonomous navigation, validated through successful real-world deployments on an aerial robot.
[CV-15] How Frag ile Are Training-Free AI-Generated Image Detectors? A Controlled Audit of Score Direction Preprocessing and Compression
链接: https://arxiv.org/abs/2606.20488
作者: Jingwen Zhou,Mingzhe Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training-free detectors of AI-generated images promise generator-agnostic deployment without classifier training, yet their reported numbers are rarely compared under a single controlled protocol. We audit two representative training-free scores – an autoencoder-reconstruction score (AEROBLADE-style) and a noise-perturbation feature-similarity score (RIGID-style) – plus a naive feature-kNN control, on a common 1,500-image GenImage-derived benchmark spanning seven generators and JPEG compression at quality 70 and 50. The audit yields three cautionary findings. (i) Implementation details masquerade as method differences: replacing the LPIPS backbone (AlexNet - VGG-16) changes overall AUROC by +0.085, and switching between resize-to-512 and native-resolution preprocessing flips per-generator conclusions by up to 0.38 AUROC. (ii) Score direction is not a property of the method but of its hyperparameters: the RIGID-style score is inverted (AUROC 0.5) on SD1.5 and Wukong at noise level sigma=0.05, recovers to 0.5 for every generator at sigma=0.01, and collapses to 0.15 at sigma=0.3. (iii) Dataset format bias inflates robustness claims: without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet-backbone reconstruction score; after bias correction the residual anomaly localizes to a single generator (BigGAN). The audited scores have complementary per-generator failure sets, but naive z-score fusion does not beat the best single score, indicating that exploiting complementarity requires direction-aware combination.
[CV-16] PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds
链接: https://arxiv.org/abs/2606.20455
作者: Haoyuan Shen,Kuihao Wang,Ruisheng Wang,Yujun Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures
Abstract:Building footprint extraction is a fundamental task in photogrammetry, remote sensing, and computer vision. Recent image-based methods have achieved remarkable progress in extracting vectorized footprints from high-resolution optical imagery. However, optical imagery inherently susceptible to occlusions, perspective distortions, and residual relief displacement, yielding incomplete or misaligned footprint extraction. Furthermore, the lack of explicit elevation information limits its direct applicability to Level of Detail building modeling. In this paper, we present PCFootprint, the first large-scale public dataset for footprint extraction from airborne laser scanning point clouds. PCFootprint comprises \num33000 tiles derived from the Estonian Land and Spatial Development Board, covering diverse urban and rural landscapes. Each tile spans \qtyproduct128 x 128\m with systematically aligned vectorized footprints aligned to point clouds. The dataset includes a \num3000 tiles cross-domain test set for evaluating generalization across geographic regions. We establish comprehensive benchmarks by evaluating mainstream methods. Experimental results reveal significant challenges including high intra-class variance, data imbalance, and noise across complex geospatial environments. We believe PCFootprint will advance future research in building modeling, urban scene understanding, and geospatial analysis. The PCFootprint dataset is publicly available at \urlthis https URL.
[CV-17] InfantFace: Detecting infant faces in neonatal clinical environments
链接: https://arxiv.org/abs/2606.20449
作者: Abdullah Bin-Obaid,Maria M. Cobo,Rebeccah Slater,Lionel Tarassenko,Mauricio Villarroel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 7 figures, 4 tables; supplementary information included
Abstract:Reliable localisation of the neonatal face is the first step for several video-camera based non-contact assessments such as pain and distress related facial expression analysis, pain scoring, cardiorespiratory signal extraction and cessation of breathing alerts. However, major challenges persist in neonatal clinical environments. Cluttered backgrounds, illumination changes and poor lighting conditions can reduce the accuracy of face detection models. Clinical interventions, monitoring equipment and, in some cases, medical devices can obstruct the face, making visual assessment difficult. We propose a one-stage YOLOv11m-based model tailored for face detection of infants in neonatal clinical environments. We combined multiple publicly available datasets (VGGFace2, CelebA, FDDB, WIDER FACE) to train and evaluate our proposed model. We then fine-tuned our model on a neonatal research dataset involving 228 videos from 114 recording sessions of 113 independent infants. Before fine-tuning, our model achieved an AP50 of 0.87, surpassing the performance of three state-of-the-art general face detectors. Performance improved further to an AP50 of 0.96 after clinical-domain adaptation. Evaluating face detection performance across different datasets remains a challenge due to the lack of publicly available neonatal datasets. Prioritising the creation of such datasets, while upholding appropriate privacy safeguards and ethical standards in their creation and use, would greatly support further progress in this field.
[CV-18] Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation
链接: https://arxiv.org/abs/2606.20419
作者: Karn Tiwari,Varnith Chordia,Prathosh A P
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Vision-language models (VLMs) often generate fluent but visually unsupported descriptions, especially by mentioning objects absent from the image. We propose QK Product Steering, a data-free, training-free, and zero-inference-cost weight edit for reducing object hallucination. The method directly edits the per-head query-key product, the operator that produces pre-softmax attention logits, by suppressing a small number of dominant singular modes in selected middle layers. The edited product is then mapped back to the query weights through a closed-form query-only update while keeping shared key weights fixed, making the edit compatible with grouped-query attention. We further decompose the QK product into symmetric and antisymmetric components to distinguish mutual content-similarity patterns from directional attention patterns. Across three GQA-based VLMs, QK Product Steering achieves an average relative CHAIR _s reduction of 4.0% , while matched random-mode controls show negligible change. Interpretability ablations show that the hallucination signal is specific to dominant QK modes and is primarily localized to the symmetric mutual-attention channel. Overall, QK Product Steering offers a simple alternative to decoding-time mitigation, requiring no additional data, fine-tuning, or inference-time overhead while largely preserving general multimodal capability.
[CV-19] On the Redundancy of Timestep Embeddings in Diffusion Models
链接: https://arxiv.org/abs/2606.20416
作者: José A. Chávez
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages
Abstract:Diffusion models rely heavily on explicit timestep embeddings to modulate the denoising process across various noise scales. In this work, we challenge the necessity of these temporal signals by analyzing their impact on U-Net and Diffusion Transformer architectures. Beyond empirical evidence, we provide a theoretical framework demonstrating that, under certain conditions, the global minimizer of the diffusion training objective can be achieved without explicit timestep conditioning. Our findings reveal a surprising robustness when timestep embeddings are completely removed. Extensive ablation studies on the CelebA and CIFAR-10 datasets show that these time-agnostic models can maintain high structural fidelity and even surpass their conditioned counterparts in competitive metrics, including FID, precision, and recall. Our analysis suggests these architectures can implicitly infer noise scales from the corrupted input under specific assumptions, rendering explicit temporal conditioning redundant. This study challenges long-standing temporal conditioning paradigms and paves the way for more efficient and structurally focused generative architectures.
[CV-20] FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows
链接: https://arxiv.org/abs/2606.20404
作者: Daniel Gilo,Sven Elflein,Ido Sobol,Or Litany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator–the depth predictor defining the constraint–is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample. We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity. We propose several variants of FlowBender, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at a minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other. Project page: this https URL
[CV-21] Geometry-Aware Superpixel Graph Transformer with Metadata for Skin Lesion Classification MICCAI2026
链接: https://arxiv.org/abs/2606.20390
作者: Muhammad Azeem,Tanveer Hussain,Amr Ahmed,Ardhendu Behera
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2026
Abstract:Automated skin cancer classification from dermoscopic images remains challenging due to heterogeneous lesion structure, strong intra-class variability, and subtle visual differences between benign and malignant cases. Existing CNN/ViT pipelines typically rely on global or patch-level features and often combine patient metadata via late fusion, which limits spatially grounded multimodal reasoning. We present a novel region-based graph learning framework that explicitly models lesions as graphs of spatially coherent superpixel regions represented as frozen CNN features. To capture fine-grained lesion arrangements, we encode inter-regional geometry as edge attributes and introduce a dedicated metadata context node connected to all regions, providing structured integration of demographic/clinical variables within the same relational space. Node representations are updated using our edge-aware graph transformer followed by attention-driven propagation, and a final graph-level embedding for benign-malignant classification. Experiments on four public benchmarks demonstrate that explicit region-level relational modeling and graph-native multimodal fusion yield consistent gains over the state-of-the-art. Consequently, we establish a new graph-centric perspective in which CNN features are modeled as relational nodes and improved through contextual integration, yielding more expressive and robust classifications.
[CV-22] Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection
链接: https://arxiv.org/abs/2606.20312
作者: Ning Dong,Yingna Su,Xin Dong,Ziyun Jiao,Xinnian Guo,Zhuangzhuang Pan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures, 7 tables. Code available at this https URL
Abstract:Pose-flow video anomaly detectors are attractive for one-class surveillance because they provide likelihood-based rankings for tracked skeleton windows. However, a single likelihood score may hide multimodal normal behavior and be sensitive to pose-observation noise. We study a frozen-detector setting in which the pose-flow backbone, cached skeleton tracks, and evaluation pipeline are fixed. Reliability-Aware Prototype Calibration (RPC) is a post-hoc score calibration method for this setting. It adds a standardized nearest-prototype deviation in the frozen latent space to the standardized flow score, and uses keypoint confidence only to gate this added geometric evidence. Thus, RPC preserves the original density signal while correcting the ranking with empirical normal-mode structure under pose reliability. Across two frozen pose-flow backbones and four datasets, RPC improves frame-level AUROC in all eight backbone-dataset pairs, with gains ranging from 0.34 to 4.49 percentage points and averaging 2.03 points. Ablation and reliability analyses show that prototype deviation is the main corrective signal, while reliability gating is most useful when pose observations are less trustworthy. These results suggest that lightweight post-hoc calibration can strengthen cached pose-flow systems when retraining or reproducing the full pose pipeline is impractical.
[CV-23] hrough the PRISM: Preference Representation in Intermediate States of Video Diffusion Models
链接: https://arxiv.org/abs/2606.20310
作者: Haoxuan Wu,Lai Man Po,Mengyang Liu,Kun Li,Hongzheng Yang,Wei Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbfPRISM (\textbfPreference \textbfRepresentation in \textbfIntermediate \textbfStates of Diffusion \textbfModels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of- N sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone’s generative performance and its inherent evaluative power, enabling self-improving video backbones.
[CV-24] GEN-Guard: Correcting Generalization Failures for Deployable Federated Surgical AI
链接: https://arxiv.org/abs/2606.20303
作者: Julia Alekseenko,Pietro Mascagni,AI4SafeChole Consortium,Nicolas Padoy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated Learning (FL) in surgical video AI enables collaborative model training without sharing sensitive data. However, standard evaluation practices - selecting the “best” global model based only on validation data from participating hospitals - can lead to suboptimal deployment choices. We identify this critical failure mode as performance leakage, where the selected model overfits internal federation data and fails to generalize to unseen institutions. We propose GEN-Guard, a practical post-hoc framework to detect and correct generalization failures in federated surgical AI. It integrates Generalization Detection via Client-Blocked Evaluation (CBE), which validates performance on isolated client distributions to prevent performance leakage, and Generalization Correction through Disagreement-Aware Distillation (DAD), which learns adaptive feature-level corrections for cross-institutional robustness. Both components operate after standard FL convergence while providing robust support for zero-shot adaptation to unseen environments. We first quantify the severity of performance leakage, observing Model Selection Failures (MSFs) exceeding 80% under standard evaluation. GEN-Guard is evaluated on two multi-center clinical challenges: surgical phase recognition in laparoscopic cholecystectomy and polyp segmentation in colonoscopy. Across both datasets, GEN-Guard consistently corrects these failures, improving in-federation F1 scores by up to 2 points, unseen-institution performance by up to 3 points, and worst-case institutional performance by 3-9 points. Performance leakage represents a systematic and previously under-recognized risk in federated surgical AI. GEN-Guard provides a practical solution for detecting and correcting such failures. By improving cross-institutional robustness and zero-shot generalization, it strengthens the reliability of FL for real-world surgical deployment.
[CV-25] CUPID: Reconstructing UV Texture Maps for Interpretable Person-of-Interest Deepfake Detection
链接: https://arxiv.org/abs/2606.20302
作者: Giovanni Affatato,Sara Mandelli,Edoardo Daniele Cannas,Paolo Bestagini,Stefano Tubaro
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deepfakes targeting a high-profile individual, known as Person-of-Interest (POI), are a threat to modern democracies and societies. Current POI deepfake detection methods still struggle to combine robustness to post-processing, efficiency and interpretability, focal aspects of modern deepfake detectors. In this paper we propose CUPID, a POI video deepfake detector that combines UV texture maps, a facial appearance representation derived from 3D face reconstructions, with the representation learning capabilities of the Masked Autoencoder (MAE). Our method does not require any deepfake videos in its training phase. Moreover, it does not even require to include a specific POI in the training set: the combination of UV texture maps extracted from real video frames and the MAE context-guided reconstruction yields a latent space that captures rich and discriminative facial features also for identities unseen during training. In the testing phase, the embeddings extracted from a query video depicting the POI can be matched against pristine reference videos to assess the video authenticity. Furthermore, operating in the UV space naturally provides an additional layer of interpretability. Specifically, we can extract decoded residual maps that highlight which facial regions of a test video deviate most from the identity representation of the corresponding POI. Experiments on four deepfake datasets show that CUPID outperforms current state of the art on most datasets and achieves the best overall robustness against strong downscaling and compression, providing also substantially faster inference. Our experimental code will be released at this https URL.
[CV-26] CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection ECCV2026
链接: https://arxiv.org/abs/2606.20300
作者: Junhao Cai,Deyu Zeng,Junhao Pang,Junyu Chen,Qiwei Liang,Xiaopin Zhong,Zongze Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026!
Abstract:Few-shot anomaly detection remains challenging due to limited training data. Multi-modal anomaly detection (MAD) offers a viable solution, leveraging 3D geometric cues to enrich 2D RGB representations and compensate for this scarcity. However, existing MAD methods apply spatially uniform feature processing, conflating stable macroscopic structures with high-frequency localized defect signals, exacerbating cross-modal misalignment and inflating false-positive rates. To overcome this, we present CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework. A LoRA-guided diffusion model generates diverse RGB samples to mitigate extreme data scarcity. For 3D normal augmentation, we employ a pre-trained diffusion model as a normal estimator. Crucially, this estimator inherently acts as a non-linear low-pass filter, directly extracting low-frequency normal representations from RGB inputs. This establishes an auxiliary estimated stream of purely low-frequency information, anchoring robust structural templates and assisting the uncompressed real stream, containing coupled high- and low-frequency components, to precisely isolate micro-defects. A Coordinate-Aware Hierarchical Feature Mapper adaptively aligns cross-modal semantics, while a multiplicative scoring mechanism filters modality-specific noise. Under the extreme 1-shot setting, CMDS-AD achieves absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, alongside 7.7% and 5.6% improvements on EyeCandies, establishing a new state-of-the-art.
[CV-27] Integrating national forest inventory airborne lidar and satellite imagery for wall-to-wall mapping of forest structure with computer vision
链接: https://arxiv.org/abs/2606.20291
作者: Luke J. Zachmann,David D. Diaz,Vincent A. Landau,Chelsey Walden-Schreiner,Tony Chang,Nathan E. Rutenbeck,Katharyn A. Duffy,Kiarie Ndegwa,Andreas Gros,Scott Conway,Guy Bayes
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management. Many planning systems and data collections combine disparate data sources with different purposes, vintages, and prediction quality, which leads to confounding behavior in operational planning systems. We introduce the VibrantForests framework, developed and applied to map forest attributes and provide a coherent foundation for effective forest and wildfire planning. VibrantForests includes a satellite-based forest structure model trained on lidar-derived samples and applied across the contiguous United States to concurrently generate estimates of canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter at 10-meter resolution. We demonstrate predictive capability spanning the full spectrum of forest conditions ranging from sparse-canopy/low-biomass to dense-canopy/high-biomass. Results show that our model extends the range at which saturation is commonly encountered in comparable passive-sensor models, and reduces regression-to-mean behavior that commonly produces overestimation of forest attributes in small/sparse conditions and underestimation in large/dense conditions. The VibrantForests framework addresses a key limitation in large-area forest and wildfire planning by delivering coherent wall-to-wall estimates of management-relevant attributes at annual cadence and 10m resolution.
[CV-28] U2Mamba: A Two-level Nested U-structure Mamba for Salient Object Detection
链接: https://arxiv.org/abs/2606.20282
作者: Junhui Li,Jialu Li,Youshan Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures
Abstract:Mamba-based models have emerged as a promising alternative for salient object detection (SOD), offering significant advantages in modeling long sequences. However, existing models often fail to explore contextual information and the depth of the entire architecture. This paper introduces U ^2 Mamba, a powerful and innovative U-structured network for salient object detection. We propose multiscale Mamba U-blocks (MMUBs) that enhance the model depth to improve local feature extraction capabilities. Our newly developed nested U-structure, incorporating MMUBs, enables the network to integrate various receptive fields from shallow and deep layers, thereby collecting richer contextual information and longer-range data without being constrained by resolution. Instead of using the traditional deep supervision scheme and top-level supervised training, we propose a hierarchical training supervision method where the loss is computed at each level during the training process. Extensive experiments demonstrate that U ^2 Mamba achieves highly competitive performance against state-of-the-art methods. The source code is available at \urlthis https URL.
[CV-29] Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications
链接: https://arxiv.org/abs/2606.20272
作者: Paul Koch,Vivek Chavan,André Sers,Adem Karakurt,Paul Hofmann,Mohamad Zaher Ziadeh,Jörg Krüger
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and best paper award at MHI-Kolloquium 2024
Abstract:AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.
[CV-30] Single-Stage Hierarchical Rectification for Weakly Supervised Histopathology Segmentation MICCAI2026
链接: https://arxiv.org/abs/2606.20250
作者: Duc T. Nguyen,Hoang-Long Nguyen,Thanh-Ha DO,Huy-Hieu Pham
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2026. This is the pre-review submitted version, not the camera-ready version. The final authenticated version will be available in the MICCAI 2026 proceedings
Abstract:Existing weakly supervised semantic segmentation (WSSS) methods in computational pathology rely on a multi-stage paradigm: class activation map (CAM) generation, offline pseudo-mask refinement, and fully supervised retraining. While established, this decoupled approach presents fundamental limitations. The multi-stage process not only incurs high computational training costs but also suffers from error propagation: local texture biases in shallow CNN layers generate false-positive artifacts that subsequent refinement steps often fail to correct. To address these persistent challenges through a simple yet highly effective approach, we propose the Single-Stage Hierarchical Rectification (SSHR) framework. Rather than passively refining CAMs post-hoc, our method proactively purifies intermediate feature representations during the forward pass. We introduce a Hierarchical Feature Rectification Module (HFRM) that utilizes deep global semantic context to filter out local anomalies in shallow layers. This mechanism generates high-fidelity activation maps directly within a single training loop. Experiments on the LUAD-HistoSeg and BCSS datasets demonstrate that SSHR outperforms state-of-the-art multi-stage methods. Furthermore, SSHR reduces training duration by 2 to 5 times. This efficiency minimizes computational overhead and accelerates clinical translation for large-scale histopathology workflows. The code is available at: this https URL
[CV-31] SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs
链接: https://arxiv.org/abs/2606.20244
作者: Bo Yin,Xiaobin Hu,Chengming Xu,Ruolin Shen,Mo Yang,Jiangning Zhang,Peng-Tao Jiang,Cheng Tan,Shuicheng YAN
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \urlthis https URL
[CV-32] BAFIS: Dataset Framework to assess occupational Bias and Human Preference in modern Text-to-image Models WACV2026
链接: https://arxiv.org/abs/2606.20241
作者: Thomas Klassert,Adrian Ulges,Biying Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE Winter Conference on Applications of Computer Vision, WACV 2026
Abstract:Generative artificial intelligence has the potential to improve productivity and transform the production of creative content. However, existing research indicates that image generation models are significantly influenced by biases. This work investigates the inherent biases and language-induced biases present in text-to-image models within the context of occupation-related image generation, complementing established metrics with human preference feedback. We present a comprehensive evaluation of five current text-to-image models: Midjourney v6.1, Stable Diffusion 3 Medium, DALL-E 3, Playground v2.5, and FLUX.1-dev , focusing on gender and ethnicity bias, image quality, and prompt alignment. To facilitate this evaluation, we developed the “Battle-Arena for Fair Image Synthesis” (BAFIS), a platform designed to collect human feedback on bias in generated images. Furthermore, we created a dataset comprising 21,140 synthetic images generated using multilingual prompts, which serves as a basis for our analysis. We further place our results within a broader social context by comparing them to official statistics from the German Federal Employment Agency. Our findings reveal systematic biases in text-to-image models, with established evaluation metrics in partial correlation with subjective user ratings. Thus, our research emphasizes the need for including human preferences to develop fairer and more inclusive text-to-image models.
[CV-33] Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models
链接: https://arxiv.org/abs/2606.20233
作者: Tianyi Xiang,Mingming He,Li Ma,Jing Liao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cinematic compositing aims to integrate green-screen characters into novel environments while maintaining physical and photometric realism. Previous methods often fail to capture the complex bidirectional interactions between characters and their surroundings, which we characterize as Character-to-Environment (C2E) physical interaction and Environment-to-Character (E2C) lighting harmonization. To address this, we propose an end-to-end video diffusion framework that jointly models C2E and E2C interactions, specifically handling the challenges of interactive props. Our approach introduces a tri-mask-guided architecture with RGB-D joint denoising to ensure physically consistent interactions among the character, props, and environment. We further develop an efficient prior-driven data curation pipeline to construct high-quality relighting pairs without expensive rendering. Finally, a reference-conditioned mechanism enables controllable environment synthesis and precise prop replacement. Extensive experiments demonstrate that our framework significantly outperforms existing methods in cinematic-quality dynamic video compositing.
[CV-34] DeepForestVisionV2: Ecology-Driven Taxonomy Expansion for Camera-Trap Monitoring in African Tropical Forests ICPR2026
链接: https://arxiv.org/abs/2606.20223
作者: Hugo Magaldi,Theau d’Audiffret,Etienne Francois Akomo-Okoue,Bala Amarasekaran,Naomi Anderson,Claire Auger,Noemie Cappelle,Daniel Cornelis,Raphael Cornette,Tobias Deschner,Gabriel Dubus,Davy Fonteyn,Rosa M. Garriga,Jennifer Hatlauf,Innocent Kasekendi,Raymond Katumba,Aram Kazandjian,Alfred Ngomanda,Stephan Ntie,Simone Pika,Xavier Rufray,Harold Rugonge,John Justice Tibesigwa,Peter van Lunteren,Hadrien Vanthomme,Joeri A. Zwerts,Sabrina Krief
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Accepted at ICPR 2026 - Computer Vision for Biodiversity Monitoring and Conservation Workshop
Abstract:Camera-trap monitoring in African tropical forests increasingly extends beyond closed-canopy interiors to riverbanks, clearings, and park edges. Among available open tools for African forest camera-trap classification, DeepForestVision is the only one providing a matched offline workflow for both photographs and videos, and previous work showed that it outperformed other available baselines on a comparable benchmark. However, it was designed for closed-canopy, ground-level forest interiors and uses a 35-class prediction space that becomes too coarse when deployments encounter arboreal primates, birds, semi-aquatic taxa, or human-associated confounders such as livestock. We present DeepForestVisionV2, an ecology-driven expansion from 35 to 64 prediction classes (61 animal classes plus human, vehicle, and blank) designed to address three recurrent deployment gradients: vertical stratification, scene openness, and anthropogenic interfaces. DeepForestVisionV2 retains the same offline workflow and is trained on 1,535,010 photographs and 243,354 videos from multi-country African tropical-forest projects. Evaluation combines a cross-country cropped-photo validation set, used to assess robustness across sites and camera-trap settings, with three held-out Uganda video benchmarks spanning the targeted gradients. On the validation set, DeepForestVisionV2 reaches 0.86 accuracy, 0.82 macro-F1, and 0.81 balanced accuracy. On the deployment benchmarks, it preserves or improves baseline accuracy despite its harder classification task, while increasing the number of identified taxa from 22 to 29 in forest-interior videos and from 4 to 9 at riverbanks. In the park-edge use case, it raises accuracy from 0.62 to 0.86 and reduces false alarms from 11 to 0. These results show that DeepForestVisionV2 materially improves field utility while preserving robustness across sites, habitats, and camera-trap settings.
[CV-35] Evaluation of Image Matching for Art Skills Assessment
链接: https://arxiv.org/abs/2606.20199
作者: Asaad Alghamdi,Michael Poor,Trung-Nghia Le,Tam V. Nguyen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MAPR 2024
Abstract:While some individuals possess a natural talent for drawing, mastering this skill requires dedicated training and practice. Determining one’s skill in the art of drawing requires proper comprehensive assessment. In this paper, we propose a method to measure drawing skill by by matching the hand-drawn image with the original template. Existing techniques often involve complex processes. However, advancements in computer vision allow us to train computers to perform these comparisons at a human-like level, thereby resolving the tedious and overwhelming traditional process. Using computer vision applications, determining image similarity involves identifying the level of similarities in an image with a reference image. We have implemented and analyzed the SIFT feature and Siamese network to measure image similarity. Our results indicate that it is feasible to assess art skill levels. Through feature analysis, we found that SIFT-based key point matching provides a more effective means of detecting drawing skills.
[CV-36] Distill Once Adapt Life-Long: Exploring Dataset Distillation for Continual Test-Time Adaptation ECCV2026
链接: https://arxiv.org/abs/2606.20196
作者: Hyun-Kurl Jang,Jihun Kim,Hyeokjun Kweon,Kuk-Jin Yoon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Continual Test-Time Adaptation (CTTA) aims to maintain model performance under evolving target domains by adapting online without labeled data. However, practical deployments often cannot retain the source dataset due to privacy or licensing constraints, and purely source-free CTTA methods tend to become unstable under long-term distribution shift, suffering from compounding self-training errors and catastrophic forgetting. We introduce DO-ALL (Distill Once, Adapt Life-Long), a plug-and-play framework that revisits source information in a compact and privacy-conscious form via Dataset Distillation (DD). Before deployment, DO-ALL performs DD to produce a small set of synthetic distilled anchors that summarize the source distribution. During adaptation, each target sample is matched with its most semantically aligned anchor, which provides a stable reference for various CTTA via source replay, representation alignment, and manifold-smoothing regularization. DO-ALL can be seamlessly integrated into existing CTTA algorithms, consistently improving long-term robustness across CIFAR100-C, ImageNet-C, and the CCC benchmark. This demonstrates the potential of leveraging DD to enable stable and continuous adaptation without retaining raw source data. The code is available at this https URL.
[CV-37] HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin ECCV2026
链接: https://arxiv.org/abs/2606.20189
作者: Maciej Wozniak,Jesper Ericsson,Hariprasath Govindarajan,Truls Nyberg,Thomas Gustafsson,Patric Jensfelt,Olov Andersson
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to ECCV 2026. Maciej and Jesper contributed equally
Abstract:Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher’s layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: this https URL.
[CV-38] Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLM s ECCV2026
链接: https://arxiv.org/abs/2606.20177
作者: Haochen Han,Jue Wang,Alex Jinpeng Wang,Fangming Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ECCV 2026 Accepted
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.
[CV-39] ARTEMIS: Agent -guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation
链接: https://arxiv.org/abs/2606.20161
作者: Tong Wang,Siwen Wang,Yaolei Qi,Jinxing Zhou,Yuting He,Guanyu Yang,Yutong Xie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Imperfectly supervised video polyp segmentation (VPS) aims to learn dense, temporally consistent masks from inexpensive supervision, including weak annotations (points, scribbles) and semi-supervision with few densely labeled frames. This setting is clinically valuable but challenging due to weak contrast, ambiguous boundaries, motion blur, and specular highlights, compounded by sparse pixel-level guidance. While SAM2 can generate dense masks from sparse inputs, direct pseudo-labeling often yields geometry-degraded masks with boundary leakage, underutilizes temporal consistency, and ignores reliability. To address these issues, we propose ARTEMIS, a unified framework for imperfectly supervised VPS driven by agent-guided reliability-aware temporal mask evolution. ARTEMIS initializes coarse masks from available supervision: SAM2 converts points/scribbles, while dense labels serve as reliable anchors. A debate-and-judge vision-language agent selects reliable temporal anchors under weak supervision, which are propagated bidirectionally with SAM2 to refine unreliable or unlabeled frames. Finally, ARTEMIS trains the segmenter using temporal reliability-aware robust learning, incorporating reliability-guided reference selection, a Reference Prototype Transport Module, and reliability-aware robust loss. These components assess mask reliability, evolve anchors over time, transport target identity across frames, and down-weight noisy supervision instead of discarding difficult samples. Experiments on SUN-SEG and CVC-ClinicDB-612 under scribble, point, and limited-label settings demonstrate that ARTEMIS achieves state-of-the-art performance. Code will be released at this https URL.
[CV-40] HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation Diagnosis and Prognosis in Multimodal PET/CT MICCAI2025
链接: https://arxiv.org/abs/2606.20143
作者: Numan Saeed,Salma Hassan,Shahad Hardan,Lishan Cai,Xinglong Liang,Moona Mazher,Abdul Qayyum,Yansong Bu,Mengye Lyu,Yue Lin,Mingyuan Meng,Chuanyi Huang,Lisheng Wang,Dalal Chamseddine,Shamimeh Ahrari,Beining Wu,Yifei Chen,Fuyou Mao,Hao Zhang,Baixiang Zhao,Surajit Ray,Muzi Guo,Lei Xiang,Jakob Dexl,Michael Ingrisch,Adrien Depeursinge,Arman Rahmim,Mathieu Hatt,Vincent Andrearczyk,Mohammad Yaqub
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures, 4 tables. Overview paper for the HECKTOR 2025 challenge, held as a satellite event at MICCAI 2025. Challenge website: this https URL
Abstract:Head and neck cancers (HNC) represent a significant global health burden, with accurate tumor delineation being essential for effective radiotherapy planning. The complexity of the oropharyngeal anatomy, combined with the heterogeneous appearance of tumors on imaging, makes manual segmentation time-intensive and subject to inter-observer variability. Beyond segmentation, predicting long-term clinical outcomes, such as recurrence-free survival (RFS), and determining human papillomavirus (HPV) status from noninvasive imaging, remain challenging yet clinically valuable goals. The HECKTOR 2025 challenge addresses these needs by establishing a comprehensive benchmark for automated HNC analysis using multimodal PET/CT imaging and electronic health records. Building on previous editions (2020-2022), this challenge features an expanded multi-institutional dataset comprising over 1,100 patients from 10 centers worldwide. Participants were tasked with three complementary objectives: (1) segmenting primary gross tumor volumes (GTVp) and metastatic lymph nodes (GTVn), (2) predicting recurrence-free survival, and (3) classifying HPV status. The challenge attracted 35 registered teams, with 15 final submissions evaluated on a held-out test set. Top-performing algorithms achieved a mean Dice similarity coefficient of 0.75 for segmentation, a concordance index of 0.66 for survival prediction, and a balanced accuracy of 0.56 for HPV classification. This paper presents a comprehensive analysis of the submitted methodologies, evaluates their performance across different lesion characteristics, and discusses their implications for clinical translation in automated oncology workflows and decision support systems.
[CV-41] SA-VIS: Sparse frame Annotations for training Video Instance Segmentation
链接: https://arxiv.org/abs/2606.20140
作者: Edoardo Mello Rella,Ajad Chhatkuli,Shipra Jain,Ender Konukoglu,Luc Van Gool
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.
[CV-42] riFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields
链接: https://arxiv.org/abs/2606.20131
作者: Haoxuan Li,Ziya Erkoç,Daniele Sirigatti,Vladislav Rosov,Lei Li,Angela Dai,Matthias Nießner
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.
[CV-43] SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation ICRA2026
链接: https://arxiv.org/abs/2606.20130
作者: Xuesong Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4th place in ICRA 2026 GOOSE 2D Semantic Segmentation Challenge
Abstract:We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.
[CV-44] When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage MICCAI2026
链接: https://arxiv.org/abs/2606.20115
作者: Nafis Fuad Shahid
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures, 2 tables. Submitted to the DeCaF Workshop at MICCAI 2026
Abstract:Conformal risk control (CRC) provides distribution-free guarantees on segmentation quality by calibrating a prediction-set threshold on held-out data. In federated deployments, the standard approach pools calibration scores across sites into a single threshold. We provide the first quantification, on real multi-institutional brain tumor data (FeTS-2022, 1,251 subjects, 20 institutions), showing that this naive pooled CRC protects the average hospital but violates coverage at 40% of individual institutions, with the worst site exceeding the target false-negative rate by 7.8 percentage points. The naive alternative, per-site local CRC, largely restores coverage but inflates prediction sets by 83x, rendering them clinically useless. We propose a shrinkage-based federated CRC protocol: each site transmits only its empirical risk curve (G scalars) to a server, which computes a shrinkage-regularized threshold per site. A single hyperparameter n0 smoothly trades worst-case coverage for prediction-set efficiency; leave-one-site-out sensitivity analysis identifies n0=19, achieving 2.7/20 violations at 2.0x stretch. We further show that direct Lagrangian optimization of coverage budgets fails, concentrating risk on vulnerable hospitals, and that the finite-sample correction term is essential: removing it triples violations. The marginal CRC guarantee is preserved by construction under the stated site-mixture assumption; per-site coverage is validated across four targets with three seeds. No patient-level images, masks, or per-volume scores leave any site.
[CV-45] Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation ICLR2026
链接: https://arxiv.org/abs/2606.20112
作者: Zhenkai Zhang,Markus Hiller,Krista A. Ehinger,Tom Drummond
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at ICLR 2026. Code available at this https URL
Abstract:Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.
[CV-46] FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model ECCV2026
链接: https://arxiv.org/abs/2606.20110
作者: Yuhwan Jeong,Hyeonseong Kim,Daehyun We,Seonkyu Song,Jinnyeong Yang,Hyun-Kurl Jang,Youngho Yoon,Kuk-Jin Yoon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Synthetic data for autonomous driving is surging, powered by diffusion models that promise scalable scene generation. Yet key obstacles remain, as enforcing multi-view and temporal consistency often relies on backbone fine-tuning or added layers, which erodes pre-trained knowledge and weakens text alignment. Models also stay close to the training distribution, struggling under adverse weather and unseen configurations, and fidelity favors frequent over rare classes. We address these gaps with FrozenDrive, a controllable generative framework that preserves a pretrained diffusion models knowledge while achieving strong consistency. FrozenDrive conditions on rich driving-stack signals and text prompts, and introduces knowledge-preserving spatio-temporal attention to impose cross-view alignment and temporal coherence in a single pass within a parameter-free frozen diffusion backbone. An additional object-focused constraint improves per-object fidelity for rare categories. Without any weather- or scene-specific fine-tuning, our model synthesizes globally coherent multi-view driving scenes from text, particularly under adverse and rare conditions, and surpasses prior baselines. On nuScenes, FrozenDrive augmented data significantly improves AD models performance, especially at night and in rain, demonstrating stronger robustness when trained with our scenario-targeted data.
[CV-47] EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors
链接: https://arxiv.org/abs/2606.20108
作者: Pengwei Wang,José Morano,Qian Wan,Hrvoje Bogunović
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in MIDL 2026. Code: this https URL
Abstract:Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning what is degradation" from human-annotated labels, EFIQA learns what should be there" by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.
[CV-48] Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration ECCV2026
链接: https://arxiv.org/abs/2606.20103
作者: Kyoleen Kwak,Daeho Kim,Jeong Woon Lee,Hyoseok Hwang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. 15 pages (excluding references), 5 figures
Abstract:Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.
[CV-49] WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization
链接: https://arxiv.org/abs/2606.20100
作者: Qian Liang,Xiaomin Li,Ying Zhang,Jia Xu,Lihao Ni,Hongrui Li,Jingjing Li,Jing Lyu,Chen Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent text-to-image generation models have demonstrated remarkable capabilities in synthesizing highly realistic images from text inputs alone. Although existing benchmarks can evaluate the generation capabilities of various models to some extent, they struggle to comprehensively and accurately measure performance across multiple dimensions, often failing to reveal the inherent deficiencies of models in specific categories. To address these limitations, we propose WeGenBench, a novel benchmark designed for the comprehensive, multi-perspective evaluation of text-to-image generation capabilities. Our benchmark comprises a total of 4,000 test prompts across two primary categories, meticulously balanced between Chinese and English to evaluate bilingual and cross-cultural generation capabilities. Beyond macroscopic scene classification, we annotate each prompt with multi-dimensional tags tailored to the distinct content and challenges of each language, thereby refining the generation tasks into more specific sub-categories. Through a cross-dimensional evaluation mechanism leveraging both scene classifications and multi-dimensional tags, WeGenBench can precisely pinpoint model shortcomings in specific generation categories. Furthermore, to measure generation quality more accurately, we design and validate several novel evaluation metrics by integrating Vision-Language Models (VLMs), which assess model performance on domain-specific tasks from three core aspects. Crucially, our approach yields both the assessment outcomes and the detailed reasoning trajectories, facilitating a rigorous verification of the accuracy and soundness of the evaluation results. Finally, we conduct systematic benchmarking on current state-of-the-art methods and provide an in-depth analysis of the limitations present in existing models.
[CV-50] Stitching and dimensionality effects on large artificially generated volume datasets
链接: https://arxiv.org/abs/2606.20095
作者: Lucas von Chamier,Jan Philipp Albrecht,Dagmar Kainmüller
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating large images via deep learning requires patching input data to accommodate hardware memory limitations, then assembling output patches, a process that can introduce stitching artifacts when neighboring patches do not align at borders. While these artifacts are known to affect segmentation tasks, their impact on generative models for style-transfer remains poorly understood. We investigated three stitching approaches and two patch dimensionalities (2D vs 3D) using cycleGAN models trained on cryo-electron microscopy datasets. We evaluated both perceptual quality and performance on downstream mitochondria segmentation. Our key findings reveal that: (1) FID scores fail to detect subtle stitching artifacts that significantly impact downstream segmentation performance, (2) 3D models with artifact-free stitching marginally outperform 2D models on downstream tasks, though the improvement barely justifies the computational cost, and (3) 2D models train more stably due to larger batch sizes. Additionally, we demonstrate that ensembling predictions from three orthogonal directions can improve low-quality volumes but provides no benefit for high-quality outputs. These results demonstrate that maximizing generative model performance on large scientific datasets requires careful consideration and mitigation of stitching artifacts, and that perceptual metrics alone are insufficient for evaluating domain adaptation quality in biomedical imaging.
[CV-51] MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer
链接: https://arxiv.org/abs/2606.20094
作者: Nefeli Andreou,Angel Martínez-González,Sabine Sternig,Matthieu Guillaumin,Epameinondas Antonakos,Michael Opitz
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.
[CV-52] EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies
链接: https://arxiv.org/abs/2606.20092
作者: Ganlin Yang,Zhangzheng Tu,Yuqiang Yang,Sitong Mao,Junyi Dong,Tianxing Chen,Jiaqi Peng,Jing Xiong,Jiafei Cao,Jifeng Dai,Wengang Zhou,Yao Mu,Tai Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA’s latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.
[CV-53] Holo-World: Unified Camera Object and Weather Control for Video World Model
链接: https://arxiv.org/abs/2606.20083
作者: Xiangchen Yin,Wenzhang Sun,Jiahui Yuan,Zijie Liu,Yinda Chen,Wei Li,Dachun Kai,Chunfeng Wang,Xiaoyan Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: \url{ this https URL } Code: \url{ this https URL }
Abstract:Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \urlthis https URL.
[CV-54] he Hidden Evolution of Disguised Visual Context inside the VLM
链接: https://arxiv.org/abs/2606.20077
作者: Wish Suharitdamrong,Tony Alex,Muhammad Awais,Sara Atito
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM’s intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.
[CV-55] Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers
链接: https://arxiv.org/abs/2606.20076
作者: Dong Hoon Lee,Seunghoon Hong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer’s fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256 \times 256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](this https URL)
[CV-56] See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View
链接: https://arxiv.org/abs/2606.20045
作者: Fanfu Xue,En Yu,Yantian Shen,Zhikun Hu,Hongjun Wang,Yang Yang,Xindi Wang,Jiande Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures
Abstract:UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at this https URL.
[CV-57] FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification ICML2026
链接: https://arxiv.org/abs/2606.20044
作者: Xuanhao Qi,Tom H. Luan,Yukang Zhang,Jinkai Zheng,Zhou Su,Shuwei Li,Lei Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICML 2026
Abstract:Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1% mAP and 9.5% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.
[CV-58] PU-UNet: Stable Multiplicative Interactions for Medical Image Segmentation ICANN2026
链接: https://arxiv.org/abs/2606.20035
作者: Ziyuan Li,Osamah Sufyan,Uwe Jaekel,Babette Dellen
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the ICANN 2026
Abstract:Many dense prediction networks rely on additive feature transformations and model higher-order feature interactions only implicitly. Product units provide an explicit mechanism for multiplicative feature modeling, but their logarithmic–exponential formulation can cause numerical instability, which has limited their use in deep dense prediction networks. In this work, we propose Product-Unit U-Net (PU-UNet), a residual U-Net that integrates stable product-unit residual blocks into rich low-resolution stages for medical image segmentation. The proposed formulation combines smooth positivity mapping with log-domain clipping, enabling stable multiplicative feature learning with negligible computational overhead. On ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925, respectively. Compared with a matched Residual U-Net baseline, PU-UNet consistently improves Dice and IoU while keeping parameters, FLOPs, and inference latency nearly unchanged, and reduces the image-level false-positive rate on normal BUSI cases from 0.077 to zero. Ablation studies suggest that the gains are associated with product-unit interactions, are strongest under low-resolution placement, and benefit from the proposed stabilization design. These results suggest that stable product-unit residual learning can be an effective way to enhance U-Net-style segmentation networks with explicit multiplicative interactions.
[CV-59] ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement
链接: https://arxiv.org/abs/2606.20032
作者: Hongming Zhu,Huaji Chen,Bowen Du,Sicong Liu,Qin Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unlike traditional remote sensing change detection that relies on predefined categories, Open-Vocabulary Change Detection (OVCD) identifies land cover changes flexibly using arbitrary text prompts. However, existing methods suffer from an inherent trade-off when modeling changes: instance-level comparison overlooks fine-grained semantic variations (e.g., partial building extensions), while direct pixel comparison proves unreliable, yielding unstable responses and boundary artifacts due to semantic ambiguity and spatial inconsistency. To this end, we propose an efficient training-free Reliability-Aware Open-Vocabulary Change Detection (ReA-OVCD) framework. It first derives candidate change regions from pixel-wise semantic discrepancies to ensure flexible and detailed localization. To ensure reliability, it subsequently introduces a collaborative refinement strategy to explicitly model change validity from both semantic and spatial perspectives. Specifically, we develop a Semantic Change Reasoning (SCR) module that reassesses changes by jointly analyzing distributional divergence and response variation, enabling the suppression of incidental inconsistencies while preserving reliable semantic shifts. In addition, a Boundary-aware Change Refinement (BCR) module is designed to mitigate artifacts stemming from boundary misalignment and uncertainty through validating whether candidate regions are supported by reliable interior pixels. Extensive experiments across multiple datasets (LEVIR-CD, WHU-CD, DSIFN, and SECOND) demonstrate that our method consistently outperforms state-of-the-art approaches, achieving \mathrmF_1^C improvements of 2.13% to 9.75% with higher computational efficiency. The code is publicly available at \this https URL
[CV-60] QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging
链接: https://arxiv.org/abs/2606.20027
作者: Luca Zedda,Davide Antonio Mura,Cecilia Di Ruberto,Maurizio Atzori,Muhammed Furkan Dasdelen,Carsten Marr,Andrea Loddo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Attention-based Multiple Instance Learning aggregators in medical imaging are prone to attention concentration, producing overconfident and unstable predictions. We introduce QG-MIL, a gated transformer aggregator that addresses this through four synergistic architectural components: RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. Together, these design choices stabilize training and distribute attention more uniformly across instances without auxiliary losses, masking, or multi-stage regularization. We evaluate QG-MIL across six benchmarks spanning whole-slide pathology and cell-level hematology, covering two fundamentally different MIL scales. The best-performing QG-MIL variants outperform leading baselines on all six benchmarks, with an average improvement of +6.1 mean macro F1 points. Attention overlays and attention mass analysis confirm more distributed instance weighting. Ablation studies show that while individual components can match the full model on specific datasets, the QG-MIL design provides the most consistent cross-domain performance and tightest variance when compared to selected baselines. We release a configurable implementation to support reproducibility at: this https URL
[CV-61] ri-Info: Generalizable Interpretable Failure Prediction for VLA Models via Information Theory
链接: https://arxiv.org/abs/2606.19998
作者: Jinghan Yang,Yunchao Zhang,Wang Yuan,Haolun Wan,Jiaming Zhang,Zhengyang Hu,Yanchao Yang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.
[CV-62] Vision-Reasoning -Guided Occlusion Removal from Light Fields
链接: https://arxiv.org/abs/2606.19985
作者: Mohamed Youssef,Oliver Bimber
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.
[CV-63] CrossFlow: One-Step Generation Across Latent and Pixel Spaces
链接: https://arxiv.org/abs/2606.19970
作者: Xiyuan Wang,Xiao Zhang,Yang Li,Ruoxi Jiang,Zhao Zhong,Liefeng Bo,Muhan Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, Under Review
Abstract:Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at 256\times256 , CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.
[CV-64] Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis
链接: https://arxiv.org/abs/2606.19966
作者: Yucheng Xing,Ling Huang,Pei Liu,Jingying Ma,Jiaqing Xu,Kai He,Mengling Feng
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Whole-slide images (WSIs) are widely used for computational cancer prognosis. However, most existing methods primarily focus on in-domain performance and fail to generalize across clinical centers. This limitation stems from their reliance on pixel-derived representations that are highly susceptible to domain-specific artifacts caused by staining protocols and scanner hardware. We hypothesize that high-level pathology semantics, such as tumor grade and micro-environmental architecture, provide a domain-invariant semantic representation that mirrors the robust diagnostic logic of human pathologists. Therefore, we propose a Semantic-Anchored Evidential Fusion Survival (SAEFS) framework, where SAEFS derives semantic anchors from WSIs via Visual Question Answering (VQA), employs a dual-stream WSI evidence extraction architecture, uses Dirichlet-based Subjective Logic to model uncertainty, and fuses semantic and visual evidence through a cautious conjunction rule to avoid overconfident fusion from correlated sources. Trained exclusively on one source domain and evaluated zero-shot across four unseen domains, SAEFS consistently outperforms state-of-the-art models both in prediction accuracy and reliability, improving the average C-index by 10.2%. Quantitative analyses further show that VQA-derived semantic features exhibit significantly lower cross-center divergence than pixel-derived features, highlighting their robustness for cross-center clinical applications.
[CV-65] ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models
链接: https://arxiv.org/abs/2606.19965
作者: Yihao Wang,Zijian He,Jie Ren,Keze Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 11 figures
Abstract:Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textscROSE (\textbfReference-conditioned \textbfOddity and \textbfSymbolic \textbfExecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textscROSE tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.
[CV-66] Addressing Detail Bottlenecks in Latent Diffusion for RGB-to-SWIR Image Translation
链接: https://arxiv.org/abs/2606.19961
作者: Kaili Wang,Martin Dimitrievski,Jose Maria Salvador,Ben Stoffelen,David Van Hamme,Lore Goetschalckx
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Latent diffusion models (LDMs) enable efficient image-to-image translation but discard fine spatial details during compression, degrading downstream perception tasks. We identify two bottlenecks: the autoencoder, which loses spatial information, and the conditioning pathway, which further degrades the source signal through naive downsampling. We propose two lightweight, backbone-agnostic fixes: a Source-Conditioned Autoencoder (SCAE) that injects high-resolution source features into the decoder via skip connections, and a Learnable Guidance Encoder (LGE) that replaces naive downsampling with a learned conditioning signal. Evaluated on RGB-to-SWIR translation for driving scenes with two denoiser backbones (U-Net and DiT), our approach improves detection mAP by up to 2x over the latent diffusion baseline, with up to 3.4x gains on small objects (COCO-small, 32^2 px^2), while achieving state-of-the-art FID. We further show that FID and detection performance are poorly correlated, motivating multi-axis evaluation. Results generalise zero-shot to the public RASMD benchmark. We will publicly release test data with annotations, all checkpoints, and training code.
[CV-67] SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis
链接: https://arxiv.org/abs/2606.19958
作者: Meixi Li,Xianlin Zhang,Yue Zhang,Xueming Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional animation production relies heavily on manual drawing and iterative refinement, particularly for key-pose design, in-betweening, and character coloring. While existing animation and video generation methods have made notable progress, they typically depend on RGB boundary frames, dense frame-wise conditions, or complete sketch sequences, limiting their applicability under low-cost input conditions. We present SketchKeyAnime, a video diffusion framework for generating structurally controllable, appearance-consistent, and temporally coherent animations from sparse key-sketch inputs. Given a single reference RGB image and a few temporally indexed key sketches, SketchKeyAnime introduces a dual-branch conditioning mechanism to encode local geometric constraints alongside semantic-temporal context. It leverages Sketch Cross Attention to fuse reference image and sketch conditions with learnable gating, and incorporates an Adaptive Weighted Loss to strengthen supervision on key-sketch frames and line-art regions. Experimental results on the Aesthetic subset of Sakuga-42M show that our approach consistently outperforms representative animation interpolation and sketch-guided generation baselines. Compared to the best-performing baseline, SketchKeyAnime reduces EDMD by 31.9% and FVD by 9.5%, demonstrating superior sketch fidelity and temporal coherence, while achieving the best overall performance across most quantitative metrics. These results validate the proposed framework and highlight its potential for low-cost, highly controllable animation creation.
[CV-68] Confidence Calibration for Multimodal LLM s: An Empirical Study through Medical VQA MICCAI2025
链接: https://arxiv.org/abs/2606.19950
作者: Yuetian Du,Yucheng Wang,Ming Kong,Tian Liang,Qiang Long,Bingdi Chen,Qiang Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by MICCAI 2025
Abstract:Multimodal Large Language Models (MLLMs) show great potential in medical tasks, but their elicited confidence often misaligns with actual accuracy, potentially leading to misdiagnosis or overlooking correct advice. This study presents the first comprehensive analysis of the relationship between accuracy and confidence in medical MLLMs. It proposes a novel method that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment, aiming to improve confidence calibration in Medical Visual Question Answering (VQA). Experiments demonstrate that our method reduces the Expected Calibration Error (ECE) by an average of 40% across three Medical VQA datasets, significantly enhancing MLLMs’ reliability. The findings highlight the importance of domain-specific calibration for MLLMs in healthcare, offering a more trustworthy solution for AI-assisted diagnosis.
[CV-69] mage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models ECCV
链接: https://arxiv.org/abs/2606.19944
作者: Yifeng Wu,Huimin Huang,Ruiluo Wu,Chunyi Lin,Guanhua Chen,Xian Wu,Wang Song,Ruize Han
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV
Abstract:Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model’s weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone’s general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget’’ regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model’s focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.
[CV-70] DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation
链接: https://arxiv.org/abs/2606.19939
作者: Wei Pan,Xuhan Zheng,Yilin Shi,Huiguo He,Hiuyi Cheng,Dezhi Peng,Minghui Liao,Lianwen Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.
[CV-71] riangular Consistency as a Universal Constraint for Learning Optical Flow ECCV2026
链接: https://arxiv.org/abs/2606.19938
作者: Yi Xiao,Carlos Rodriguez Coronel,Jing Zhan,Haniyeh Ehsani Oskouie,Alex Wong,Dong Lao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ECCV 2026
Abstract:We propose triangular consistency as a first-principled constraint for optical flow, which is agnostic to network architecture, supervision type, and dataset, and applies to both image-pair and multi-frame settings. This simple but powerful constraint is to compose two flows to induce a third flow and enforce consistency among the three. The composed flows may arise from (i) image pairs, yielding cycle consistency; (ii) multiple video frames, producing longer-range motion through temporal chaining; or (iii) image pairs combined with controlled synthetic transformations, which becomes data augmentation. This triangular consistency introduces negligible computational overhead and requires no additional annotations. Since it is derived directly from the geometry of optical flow, it does not rely on model-specific assumptions and serves as a ``universal’’ plug-and-play component for optical flow training. Experiments show consistent improvement across supervised, unsupervised, and transfer learning settings.
[CV-72] Speeding up the annotation process in semantic segmentation industrial applications
链接: https://arxiv.org/abs/2606.19934
作者: Marta Fernandez-Moreno,Margarita Guerrero,Rosalia Rementeria,Pablo Mesejo,Raul Moreno
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current machine learning models commonly require large and well-annotated datasets. However, the annotation process often becomes a bottleneck, with increased complexity leading to higher chances of human errors. Within this context, our goal in this paper is to leverage unsupervised algorithms to improve data annotation efficiency for complex semantic segmentation problems in industrial materials science. Previous research has quantified labeling time and others explored unsupervised methods. However, to the best of our knowledge, this is the first study to quantify how much unsupervised algorithms accelerate the labeling process. We aim to validate the extent to which this laborious process can be accelerated, focusing on semantic segmentation tasks that involve annotating each pixel of high-resolution images, such as the microstructure characterization challenge in materials science. Specifically, we demonstrate that by using unsupervised computer vision algorithms, the time required for the labeling process can be reduced from 170 hours to 37 hours, achieving an approximate reduction of 78%. The dataset we work with includes large images of dimensions 1280x959 and 960x703, which further increases the complexity of the annotation task. Despite these challenges, we create and share the largest public steel microstructure segmentation dataset to date, available under MIT License with permanent DOI, contributing a fully annotated, high-resolution dataset to the field. Additionally, this is the first work to compare the labeling time from scratch (a common approach in previous studies) to the labeling time when using these unsupervised algorithms as a pre-annotation step. Furthermore, we provide a Deep Learning model trained on this dataset, validated by field experts, and deployed in an industrial setting, serving as an initial benchmark for this public dataset.
[CV-73] Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models ICML2026
链接: https://arxiv.org/abs/2606.19932
作者: Jindi Lv,Aoyu Li,Yuhao Zhou,Zheng Zhu,Xiaofeng Wang,Qing Ye,Yueqi Duan,Wentao Feng,Jiancheng Lv
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Mamba demonstrates strong efficiency in modeling long visual sequences. However, when token reduction is applied to structurally enhanced Mamba variants, these models exhibit a severe performance collapse. We attribute this degradation to the spatially agnostic nature of existing reduction methods, which violate the two-dimensional structural premise required by the selective scanning mechanism. In this work, we propose STORM, a spatial-aware token reduction framework designed to maintain structural integrity throughout the compression process. STORM reformulates reduction into a structured operation on spatial units, enforcing localized constraints to maintain both grid topology and neighborhood coherence. As a plug-and-play module, STORM equips existing reduction pipelines with explicit spatial awareness without any training. Empirical results demonstrate that STORM achieves state-of-the-art pruning accuracy across diverse vision Mamba backbones under training-free settings. Notably, STORM delivers a substantial accuracy recovery on VMamba, outperforming prior methods by up to 63.3% in top-1 accuracy. Meanwhile, STORM incurs only a 1.0% accuracy drop on PlainMamba, achieving performance comparable to ViT.
[CV-74] CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLM s
链接: https://arxiv.org/abs/2606.19927
作者: Chengwen Liu,Hao Peng,Jisheng Dang,Hong Peng,Bin Hu,Tat-Seng Chua
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model’s evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at this https URL.
[CV-75] SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLM s via Task-Oriented Visual Supervision IJCAI2026
链接: https://arxiv.org/abs/2606.19915
作者: Jiayu Tang,Yuchen Zhou,Chao Gou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2026
Abstract:Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model’s representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model’s intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs’ spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.
[CV-76] Gaussian Process Prior Variational Autoencoder for Endoscopic Videos
链接: https://arxiv.org/abs/2606.19908
作者: Ivan De Boi,Xinxing Shi,Xiaoyu Jiang,Tim J.M. Jaspers,Francisco Caetano,Mauricio A. Alvarez,Fons van der Sommen,Sam Van der Jeught
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Endoscopic video analysis is essential for gastrointestinal diagnosis and computer-assisted interventions, but video sequences are routinely degraded by specular reflections, motion artifacts, and missing frames. These transient corruptions can distract clinicians, reduce image interpretability, and disrupt downstream tasks such as 3D reconstruction and navigation. Effective restoration therefore requires methods that exploit temporal continuity rather than treating frames in isolation. We introduce a Gaussian Process Prior Variational Autoencoder (GPVAE) framework for endoscopic video restoration that replaces the standard factorized latent prior with a temporal Gaussian process prior, enabling interpolation of missing frames with uncertainty-aware reconstruction. The framework combines endoscopy-specific encoders, including a convolutional EndoVAE backbone and pretrained Vision Transformer encoders from GastroNet-5M, with two scalable GP approximations: Hierarchical Prior Approximation (HPA) and Sparse Precision Approximation (SPA). Specular reflections are handled using a DUCKNet-based masking pipeline that excludes corrupted pixels from the reconstruction objective. On the C3VDv2 colonoscopy dataset, the best GPVAE variants reduced image reconstruction RMSE by 21.9% on average, and by up to 26.1%, relative to matched VAE baselines. Downstream trajectory RMSE was reduced by 12.7% on average across classical visual odometry and a pretrained PoseNet, at an average increase of 27.3% in training time per epoch. Finally, the GP posterior provides per-frame uncertainty estimates that reflect temporal support and offer a confidence signal for restored frames. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.19908 [cs.CV] (or arXiv:2606.19908v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.19908 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-77] Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution CVPR2026
链接: https://arxiv.org/abs/2606.19901
作者: Mingyu Choi,Woo Kyoung Han,Sunghoon Im,Kyong Hwan Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings
Abstract:Linear recurrent unit (LRU), designed with a principled formulation for stable linear recurrence, has demonstrated promising accuracy and robustness on long-range dependency tasks. However, its static parameterization and single-scan method limits its applicability to 2D vision tasks. In this study, we propose a LRU-based restoration network with a semantic modulating unit (SMU) to achieve a harmonious balance between performance and efficiency in single-image super-resolution. The SMU plays three key roles: LRU modulation, spatial categorization, and feature enhancement through learned prototype. Extensive experiments demonstrate that our method quantitatively and qualitatively surpasses recent state-of-the-art methods. Notably, our approach achieves superior performance with computational complexity on par with existing methods. The source code and models are available at this https URL
[CV-78] SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics
链接: https://arxiv.org/abs/2606.19889
作者: Wentao Pan,Wuyang Li,Shengyuan Liu,Xinyu Liu,Hengyu Liu,Yixuan Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action-conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning, strengthening physically consistent instrument-tissue dynamics. Drift Adaptation Training mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld-Bench, featuring diverse procedure types, long-range rollouts, and decoupled metrics for instrument-motion accuracy and tissue-response fidelity. Extensive experiments show that SurgVista consistently outperforms state-of-the-art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.
[CV-79] Multimodal Concept Bottleneck Models NEURIPS2025
链接: https://arxiv.org/abs/2606.19882
作者: Tongqing Shi,Ge Yan,Tuomas Oikarinen,Tsui-Wei Weng
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Present at NeurIPS 2025 Mechanistic Interpretability Workshop
Abstract:Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.
[CV-80] MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM ICRA2026
链接: https://arxiv.org/abs/2606.19874
作者: Fan Zhu,Ziyu Chen,Peichen Liu,Yifan Zhao,Zhisong Xu,Hui Zhu,Hongxing Zhou,Sixun Liu,Chunmao Jiang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026
Abstract:3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.
[CV-81] PSCT-Net: Geometry-Aware Pediatric Skull CT Reconstruction via Differentiable Back-Projection and Attention-Guided Refinement
链接: https://arxiv.org/abs/2606.19867
作者: Dong Yeong Kim,Jaewon Choi,Youmin Shin,Jungyu Lee,Myeongseop Kim,Jinwook Choi,Joo Whan Kim,Young-Gon Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11pages, 5 figures
Abstract:Computed Tomography (CT) is essential for diagnosing pediatric craniofacial abnormalities, yet poses radiation risks to developing anatomies. Reconstructing 3D CT from sparse bi-planar X-rays offers a low-dose alternative but is severely ill-posed. Existing methods employ geometry-agnostic feature lifting, naively projecting 2D features into 3D without explicit spatial modeling, causing depth ambiguity and degraded osseous boundaries. We present PSCT-Net, a geometry-aware framework with differentiable back-projection. Differentiable back-projection establishes a spatially faithful volumetric prior, alleviating depth ambiguity. An Attention-Guided Projection (AGP-3D) module then learns non-linear voxel-wise correspondences between 2D regions and 3D locations. A Bidirectional Mamba (BiM-3D) module captures long-range volumetric dependencies with linear complexity. We further curate a private institutional pediatric skull CT cohort, PedSkull-CT, comprising normal and pathological cases for internal evaluation, addressing the gap in adult-centric, trunk-focused datasets.
[CV-82] ViCoStream: Streaming VideoLLM s Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference
链接: https://arxiv.org/abs/2606.19849
作者: Yang Tan,Junlong Tong,Linan Yue,Hao Wu,Pengfei Fang,Xiaoyu Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures, 13 tables
Abstract:Streaming VideoLLMs must continuously process incoming video while maintaining low query latency, making both video-ingestion throughput and query-time responsiveness critical for real-time deployment. Existing methods largely focus on accelerating individual modules, such as visual encoding, token pruning, or KV-cache compression, but provide limited insight into whether the resulting system can sustain real-time streaming performance. We formulate streaming VideoLLM inference as a coordinated pipeline spanning visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Building on this formulation, we propose ViCoStream (Video Coordinated Streaming), a stage-wise coordinated streaming framework that combines chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to bound per-chunk computation and memory costs. We further provide a systematic study of bottleneck migration, revealing how chunk size, token retention, attention locality, and retrieval scope shape the throughput-accuracy trade-off. Experiments with Qwen2.5-VL-3B/7B-Instruct across multiple streaming benchmarks show that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU while maintaining accuracy close to full-history baselines.
[CV-83] OTCHA: Optimal Transport-driven Confidence-aware Latent Hub Alignment for Multi-View Medical Image Classification MICCAI2026
链接: https://arxiv.org/abs/2606.19838
作者: Jiwoong Yang,Haejun Chung,Ikbeom Jang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2026
Abstract:Multi-view imaging, such as mammography and chest radiography, is a standard component of clinical practice. However, medical images are often unregistered and contain view-specific artifacts or irrelevant background cues that can obscure diagnostically relevant findings. Many existing methods directly fuse per-view representations, allowing such irrelevant content to contaminate the fused embedding and reducing robustness under varying view configurations. We propose OTCHA, a confidence-aware latent hub token alignment module based on optimal transport (OT) that refines patch tokens before fusion for multi-view classification. OTCHA introduces a set of learnable latent hub tokens shared across views. For each view, we compute an OT plan between patch tokens and hub tokens that jointly considers feature similarity and geometry, and augment the OT formulation with token-conditional dustbins to enable partial matching and discard irrelevant tokens. The resulting transport plan provides token-wise matching confidence, which gates hub-mediated message passing and weights a novel optimal-transport-based representation alignment loss to stabilize refinement. Experiments on three multi-view medical image datasets demonstrate consistent improvements over competing baselines across diverse anatomies and view configurations. Our code is available at this https URL.
[CV-84] World Engine: Towards the Era of Post-Training for Autonomous Driving
链接: https://arxiv.org/abs/2606.19836
作者: Tianyu Li,Li Chen,Caojun Wang,Haochen Liu,Kashyap Chitta,Zhenjie Yang,Yuhang Lu,Naisheng Ye,Yihang Qiu,Yufei Wang,Luoxi Zou,Jiaxin Peng,Jin Pan,Zhaoyu Su,Andrei Bursuc,Shengbo Eben Li,Andreas Geiger,Peng Su,Hongyang Li
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Project Page: this https URL
Abstract:Autonomous vehicles must operate safely in the real world, where errors can have severe consequences. Although modern end-to-end driving policies excel in routine scenarios, their reliability is limited by the scarcity of safety-critical ``long-tail’’ events in real driving datasets. These rare interactions define the practical safety boundary of the learned policy, yet they are difficult to collect at scale in the real world. Here we show that this fundamental limitation can be addressed by post-training pre-trained driving models on synthesized high-stakes interactions. We introduce World Engine, a generative framework that reconstructs high-fidelity interactive environments from real-world logs and systematically extrapolates them into realistic safety-critical variations. This paradigm enables reinforcement-based post-training to align policies with safety constraints, circumventing the physical risks inherent in real-world exploration. On a public benchmark built on nuPlan, World Engine substantially reduces failures in rare safety-critical scenarios and yields significantly larger gains than scaling pre-training data alone. Furthermore, when deployed on a production-scale autonomous driving system, the resulting policy reduces simulated collisions and demonstrates measurable improvements in on-road testing, showing that post-training on synthesized, safety-critical interactions offers a scalable and effective pathway to safer autonomous driving. The full codebase suite, including training, is released to the public.
[CV-85] Neural Events: Discrete Asynchronous Autoencoders for Event-Based Vision
链接: https://arxiv.org/abs/2606.19835
作者: Roberto Pellerito,Daniel Gehrig,Shintaro Shiba,Davide Scaramuzza
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras capture dynamic scenes with exceptional temporal fidelity by representing them as a continuous stream of microsecond resolution \textitevents. Each individual event, however, only carries minimal semantic value, merely signaling a localized brightness change. To derive meaningful signals, downstream algorithms need to quickly integrate cues from a potentially massive torrent of low-information events. Current architectures, however, are easily overwhelmed, struggling to balance capturing fine-grained temporal dynamics and maintaining a manageable data throughput. This paper proposes a framework to re-tokenize event streams into a small set of highly informative \textitneural events, each representing a local spatio-temporal context window with a discrete learnable code. Every time this code flips, a neural event is triggered, yielding a highly compressed data stream. We demonstrate that, across object detection and classification, networks trained on neural events are on par or surpass the performance of state-of-the-art approaches while reducing the event rate by a factor of 2.0.
[CV-86] 3D-PLOT-LLM : Part-Level Object Tokens for 3D Large Language Models
链接: https://arxiv.org/abs/2606.19828
作者: Jintang Xue,Xinyu Wang,Yixing Wu,Jingwen Chen,C.-C. Jay Kuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM’s own vocabulary. Our model, 3D-PLOT-LLM, partitions the frozen point encoder’s patches into K locally coherent regions and inserts, before each region’s patch tokens, a learnable per-region marker and a reserved vocabulary token part_k; a Marker-Space Refinement (MSR) module then conditions each marker on its region’s spatial statistics and adjacency neighbors. The model thus cites parts in its output and follows prompts that refer to parts by token, a capability absent from prior object-level 3D MLLMs. To probe this interface, we construct PartVerse-QA, a vocabulary-level part-QA benchmark adapted from PartVerse mesh annotations (77K training pairs and 588 held-out queries on disjoint object splits), on which 3D-PLOT-LLM reaches caption-to-slots Jaccard 0.459 and Exact-match 13.78%, with a slot-to-caption GPT-4o judge of 44.68. On the 3DCoMPaT-GrIn part-aware grounded description benchmark, 3D-PLOT-LLM outperforms PointLLM, Kestrel, PARIS3D, and SegPoint on every text-output metric, and ShapeLLM on 3 of 4, with up to +3.03 GPT-4o judge over PointLLM. On Objaverse whole-object captioning, adding PartVerse-QA at Stage 2 yields +0.65 SBERT and +1.85 GPT-4o over PointLLM, and tops PointLLM-PiSA on 4 of 5 traditional metrics (SBERT, SimCSE, BLEU-1, METEOR) despite targeting a different (part-grounded) objective. All with under 1M new trainable parameters on a frozen point encoder, an order of magnitude below prior part-aware 3D MLLMs, and no segmentation decoder or bounding-box head.
[CV-87] CSWinUNETR: Segmentation of Thin Anatomical Structures in Medical Images MICCAI2026
链接: https://arxiv.org/abs/2606.19824
作者: Junho Moon,Haejun Chung,Ikbeom Jang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at MICCAI 2026
Abstract:Accurate segmentation of thin, tortuous anatomical structures, such as retinal vessels, cerebral vasculature, and facial wrinkles, remains challenging due to low contrast, frequent discontinuities, and severe class imbalance. Although recent convolutional and Transformer-based models have improved performance, they often yield fragmented predictions and fail to recover fine branches. We propose CSWinUNETR, a general-purpose backbone for 2D and 3D thin-structure segmentation. It employs cross-shaped stripe self-attention to model long-range principal-axis context and incorporates cyclic shifts to enhance information exchange across stripes. To better preserve fine-grained details, we further introduce a detail-enhanced multi-scale self-attention module that aggregates contextual features from multi-resolution representations. In addition, we propose sparse-control dynamic snake convolution, which reconstructs reliable dense curvilinear kernels from sparsely predicted control points to better follow tortuous geometry. Extensive experiments on four benchmarks across ophthalmology, neurovascular imaging, and dermatology demonstrate that CSWinUNETR consistently outperforms state-of-the-art methods without task-specific post-processing or topology-aware losses. The code is available at this https URL.
[CV-88] raining-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance
链接: https://arxiv.org/abs/2606.19817
作者: Myeongseok Nam,Donghoon Yeo,Seungwook Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures
Abstract:With the recent advent of image generative models, synthetic data are increasingly being used to supplement limited real datasets for training computer vision models. However, not all synthetic datasets improve performance equally, and their effectiveness can only be assessed by training a downstream model, which is computationally expensive and time-consuming. This problem is pronounced in the task of object detection, where the required annotations are much more dense due to bounding boxes. In this paper, we propose a pre-computable metric family, dubbed Conditional-Composition Domain Match (CCDM), which serves as a proxy for the relative utility of candidate synthetic training sets for downstream detection. Experiments on the VisDrone-DET dataset show that the CCDM metric families achieve a Spearman correlation of 1.0 with the downstream performance of YOLOv8, clearly outperforming existing metrics for synthetic image evaluation.
[CV-89] ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number
链接: https://arxiv.org/abs/2606.19805
作者: Zijie Meng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by SCA2026(poster)
Abstract:Transferring the camera motion of a reference video to a freshly generated one lets creators reuse cinematic moves. Yet reference and target often live at incompatible scales – a sweep across a galaxy versus a nudge across a desk – and naively reusing the recovered trajectory yields either imperceptible or violently exaggerated motion. We trace this to a geometric fact: translation-induced image motion scales as ||T||/Z, so a monocular trajectory is meaningful only up to a depth-scale gauge. We distill this into the Parallax Number Pi = ||Delta T|| / Zbar, a dimensionless, gauge-invariant descriptor of how strongly a camera move is felt, and prove that it – not the raw trajectory – is the quantity that scale-faithful transfer must preserve. ParaScale is a plug-and-play module that reads Pi off any reference video and re-realizes it against the target scene’s own depth, per frame, leaving rotation untouched. Sitting between pose extraction and pose injection, it requires no retraining and drops into any pose-conditioned generator. We further introduce the Parallax Consistency Error (PCE), a scale-symmetric metric that – unlike the similarity-aligned TransErr – exposes scene-scale mismatch. Across scale regimes spanning four orders of magnitude and multiple backbones, ParaScale keeps the realized parallax on the identity line and cuts PCE by more than 3x over uncalibrated transfer with no loss of visual fidelity.
[CV-90] HypOProto: Hyperbolic Ordinal Prototypes for Left Ventricular Filling Pressure Classification
链接: https://arxiv.org/abs/2606.19804
作者: Victoria Wu,Nima Hashemi,Hooman Vaseli,Christina Luong,Purang Abolmaesumi,Teresa S. M. Tsang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Echocardiography (echo) is a widely used imaging modality for assessing cardiac function, with Left Ventricular Filling Pressure (LVFP) serving as a critical physiological marker for conditions such as heart failure. Standard LVFP classification into normal \emphvs elevated categories relies on the Doppler-derived E/e’ ratio, which is operator-dependent and often unavailable in resource-limited settings, motivating methods that infer LVFP directly from B-mode echo. Existing deep learning approaches achieve high performance but remain largely black-box, limiting clinical interpretability. We propose HypOProto, a hyperbolic, ordinal prototype-based framework for interpretable LVFP classification using a frozen, explainable foundation model backbone. HypOProto arranges prototypes along the physiological E/e’ scale, placing borderline cases near the hyperboloid root where small angular differences separate similar cases, while normal and elevated cases occupy outward positions reflecting increasing diagnostic certainty. This hyperbolic geometry encodes clinically meaningful ordinal relationships and improves interpretability. We also introduce a novel Hyperbolic Prototype Angular Separation (HyperPAS) loss, enforcing inter-class prototype separation in hyperbolic space. HypOProto achieves SOTA performance while maintaining transparency, and highlights clinically relevant regions in visualizations. This work represents the first prototype-based framework for LVFP classification in echo. Our code can be found at this https URL.
[CV-91] Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems
链接: https://arxiv.org/abs/2606.19802
作者: Nicolas Zilberstein,Morteza Mardani,Santiago Segarra
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image restoration faces a fundamental tradeoff: methods that minimize error produce blurry reconstructions, while those that maximize perceptual quality yield sharp but less faithful images. Existing approaches either commit to a single operating point on this distortion perception (DP) frontier or require paired-data supervision, auxiliary models, or hyperparameter tuning of the sampler to access different points. We show that flow map models, a recent extension of flow matching for few-step sampling that learns an average field, implicitly define a one-parameter family of denoisers that continuously spans the DP frontier. The lookahead parameter t acts as a control knob between the MMSE and perceptual regimes. For Gaussian targets, we prove that varying t exactly recovers the optimal DP frontier; for natural images, we observe similar behavior empirically. Within a Plug-and-Play solver, the same mechanism extends to general inverse problems, where it controls a tradeoff between perceptual alignment and data consistency. Despite the lack of exact optimality guarantees in this setting, a single trained flow map spans the DP tradeoff, matching or exceeding specialized baselines at both extremes. Extensive experiments on CelebA ( 128\times 128 ) and AFHQ ( 256\times 256 ) across several linear and nonlinear inverse tasks validate our findings.
[CV-92] Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding
链接: https://arxiv.org/abs/2606.19776
作者: Jianing Li,Zhou Fang,Yijiang Liu,Li Du
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.
[CV-93] VFACamou: View-Fused Adversarial Camouflage for Environment-Adaptive Physical Evasion ICME2026
链接: https://arxiv.org/abs/2606.19736
作者: Shihui Yan,Hu Liu,Junyu Shi,Zihui Zhu,Ziqi Zhou,Yufei Song,Youming Geng,Minghui Li,Shengshan Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2026
Abstract:Adversarial camouflage in the physical world remains highly challenging, particularly under UAV reconnaissance where targets undergo continuous geometric changes and extreme illumination variations. Existing methods either optimize 2D digital perturbations that fail to generalize to dynamic viewpoints or produce visually unnatural textures that cannot be deployed in real scenarios. Therefore, we propose an end-to-end framework for adversarial camouflage generation that automatically produces wearable adversarial patterns and maintains stable attack performance in real physical environments with changing viewpoints, poses, and lighting conditions. Our method integrates UV-volume rendering with a diffusion-based texture generator, enabling consistent appearance under varying scales, poses, and lighting conditions. To ensure environmental realism, we propose an illumination color consistency estimator that extracts dominant background attributes and guides a natural texture loss to align the generated UV texture with the surrounding environment. A multi-scale dynamic training strategy further enhances robustness against viewpoint shifts and body deformation. Extensive experiments across multiple mainstream detectors demonstrate that our method achieves strong and stable physical attack performance while maintaining high perceptual naturalness, reducing human detection rates without introducing unnatural artifacts.
[CV-94] GLARE: A Natural Language Interface for Querying Global Explanations
链接: https://arxiv.org/abs/2606.19735
作者: Bhavan Vasu,Rajesh Mangannavar
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 2 figures
Abstract:While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system’s core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.
[CV-95] QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval
链接: https://arxiv.org/abs/2606.19733
作者: Xiuyuan Zhu,Ke Lu,Zijie Yang,Chao Yue,Jian Xue,Dongming Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 6 tables. Accepted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2026)
Abstract:Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a “scene-level embedding” paradigm, which requires distilling high-dimensional semantic features into every 3D primitive. This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval. Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering. Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.
[CV-96] One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model
链接: https://arxiv.org/abs/2606.19718
作者: Shenjian Gong,Kangkan Wang,Shanshan Zhang,Jian Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 10 figures
Abstract:This paper addresses the challenge of one-shot novel view and pose human image synthesis. The existing methods transfer the reference human image to a target pose using a set of 2D pose keypoints or synthesize human images based on generalizable human NeRF which uses human model priors to extract point-wise features. However, pose transfer based methods can not handle complex human pose using ambiguous 2D pose as the condition, while generalizable human NeRFs may be inaccurate to recover occluded/invisiable human parts without extracted reliable features. To solve these problems, we propose a novel approach for novel view and pose synthesis from a singe human image via conditional denoising diffusion model. Our diffusion model divides the novel view and pose synthesis problem into a sequence of conditional denoising steps. Specifically, to generate humans with complex and arbitrary poses, we introduce 3D human priors, i.e., 3D normal map and color prompt, as geometry and color conditions into the generation process. By transferring the reference human into the target human with a series of diffusion steps, our diffusion model enables high-quality synthesis including the occluded/invisible parts. Further, we propose a self-reconstruction based customized refinement to enhance fine details when tested on novel this http URL results on different public datasets demonstrate that our approach significantly outperforms previous methods and also shows better generalization ability across datasets. The code will be made publicly available at this https URL.
[CV-97] Efficient Neural Network Model Selection for Few-Class Application Datasets
链接: https://arxiv.org/abs/2606.19712
作者: Bryan Bo Cao,Abhinav Sharma,Lawrence O’Gorman,Michael Coss,Shubham Jain
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 9 tables, 13 figures
Abstract:While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are typically evaluated on datasets with thousands of classes, yet many real-world applications involve fewer than ten. To address this understudied but common setting, we develop a measure of classification difficulty based on data-side properties and show how it enables more efficient model selection for few-class datasets, where traditional approaches are less effective. We term this phenomenon “few-class distinctiveness”. Our metric allows comparison of models and datasets 6 to 29 \times faster than repeated training and testing. Leveraging this insight, we extend scaled model families below the smallest published models, achieving greater efficiency at similar accuracy, for example models up to 42% smaller than YOLOv5-nano for a mobile robot task. Targeting resource-constrained applications, we demonstrate few-class model selection across mobile robot, drone, and IoT scenarios, highlighting practical gains in efficiency without sacrificing performance.
[CV-98] Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval
链接: https://arxiv.org/abs/2606.19684
作者: Nguyen Cao Hoang,Hoang Bui Le,Nam Vo Hoang,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2025
Abstract:Composed image retrieval retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, pattern, and texture. However, existing approaches face limitations due to scarce annotated data and simplistic negative sampling. We propose a novel framework that integrates a multi-modal large language model (LLaVA) to generate attribute-aware triplets and introduces a two-stage fine-tuning strategy to enhance contrastive learning. We leverage pretrained vision-language models, such as CLIP-ViT/B32, to generate and concatenate sentence-level prompts with the relative caption and to scale the number of negatives using static representations. Experimental results demonstrate enhanced compositional reasoning and improved fine-grained retrieval behavior, underscoring the feasibility and potential of the proposed framework for fashion retrieval.
[CV-99] Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval
链接: https://arxiv.org/abs/2606.19682
作者: Duc-Tho Nguyen,Hieu-Hoc Tran-Minh,Khanh-Hoa Lam,Hoang-Nhut Ly,Huu-Phuc Huynh,Thanh-Tien Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2025
Abstract:This paper presents Vortex, the multimodal video retrieval system developed by our team, FocusOnFun, for the Ho Chi Minh City AI Challenge 2025, designed to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction, multimodal metadata generation from vision-language and speech models, and a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings through Reciprocal Rank Fusion to balance global and fine-grained semantics. To enhance interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture enables scalable indexing and efficient retrieval. Evaluated in the official competition, our FocusOnFun team’s system achieved a score of 79.6/88 (90.5%) in the Preliminary Round and was further evaluated in the Final Round, achieving an Excellent' overall performance with Outstanding’ results in the question-answering (QA) task. This demonstrating the complementary strengths of CLIP and SigLIP2 and confirming the effectiveness of the hybrid retrieval approach. The system establishes a robust foundation for future research in intelligent, context-aware, and interactive video retrieval.
[CV-100] Morpher: Toward Robust Simultaneous Motion-Location Editing
链接: https://arxiv.org/abs/2606.19676
作者: Haengbok Chung
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have achieved remarkable success in image and video generation and editing. While recent studies have extended these efforts toward motion editing, simultaneously transforming both motion and location-despite its practical importance-remains largely unexplored. To better understand robust motion-location editing, we first analyze the fundamental factors that degrade its quality. Based on this analysis, we propose TeleMorpher, one of the first one-shot frameworks to the best of our knowledge, for simultaneous motion-location editing. Our approach leverages motion priors, a target motion-centric video generated from an off-the-shelf model as motion-editing guidance, and the ground truth motion to enable more controllable and precise motion-location editing. Via this, our framework works as follows: (1) we first disentangle the protagonist and the background via pre-trained segmentation and inpainting models. (2) Then, we introduce a training-free pose warping that edits the protagonist’s motion with the motion prior as the guidance. (3) The result of warped motion video is directly injected into a baseline motion editor during inference, mitigating the difference between source and target motions while preserving the appearance of the source video. (4) To enhance the reliability of quantitative evaluations, we propose two new LPIPS-based metrics that measure the background consistency before and after the motion editing and the fidelity of motion editing performance via measuring the difference between the extracted protagonist’s skeletons from source and target videos. Experiments with in-the-wild videos and the TaiChi dataset demonstrate that TeleMorpher achieves superior performance across both quantitative and qualitative measurements (real-human evaluation), underscoring its effectiveness.
[CV-101] Learning When to Denoise: Optimizing Asynchronous Schedules for Latent Diffusion
链接: https://arxiv.org/abs/2606.19662
作者: Bingshuo Qian,Xiang Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 9 figures, 4 tables
Abstract:Multi-representation diffusion models can improve visual synthesis by denoising complementary views of an image, but their performance depends critically on the asynchronous schedule that determines when each representation is denoised. We propose to learn this schedule. Our method formulates asynchronous flow matching over multiple representation spaces and uses a schedule-corrected objective that keeps each representation’s local noising-time weights fixed as the schedule changes. We instantiate the schedule with a flexible parametric class that is convex and monotone by construction, and learn it using a fast joint probe with less than 1% additional training compute. On ImageNet 256x256, the learned schedule substantially improves both convergence speed and final quality under a matched 675M-parameter XL backbone. With AutoGuidance, our 200-epoch model reaches FID 1.05, matching the 800-epoch SFD-XL baseline with 4x less training. Training to 600 epochs further improves to FID 1.02, outperforming the 1B-parameter SFD-XXL result of FID 1.04 while using a smaller model. In the unguided setting, our 200-epoch model reaches FID 2.37, already below the best 800-epoch SFD-XL result (2.54) at 4x less training, and improves to FID 2.14 at 600 epochs. Code is available at this https URL
[CV-102] BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation
链接: https://arxiv.org/abs/2606.19651
作者: Max Van Puyvelde,Ibrahim Gulluk,Wim Van Criekinge,Olivier Gevaert
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.
[CV-103] Scaling Self-Play for End-to-End Driving
链接: https://arxiv.org/abs/2606.19641
作者: Luke Rowe,Roger Girgis,Rodrigue de Schaetzen,Daphne Cornelisse,Alaap Grandhi,Felix Heide,Eugene Vinitsky,Christopher Pal,Liam Paull
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird’s-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.
[CV-104] GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution
链接: https://arxiv.org/abs/2606.19617
作者: Max Shad,Naeem Khoshnevis
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:
Abstract:We present GB-LSR (Global-Bandwidth Local Spectral Representation), a fixed-grid local spectral representation for continuous image reconstruction. The image domain is partitioned into non-overlapping square patches, each carrying coefficients for a truncated Fourier basis predicted from shared convolutional-encoder features. A single trainable scalar bandwidth is shared globally across all patches and images, and reconstruction at any continuous coordinate is a fixed-size basis contraction whose cost is independent of image size. We study three bandwidth-handling variants: a trainable global scalar (main), a fixed global scalar, and a per-patch bandwidth field. On a standardized native-reconstruction benchmark across Kodak, Set14, and Urban100, the main variant outperforms matched-budget amortized LIIF / LTE / WIRE re-implementations by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS, while running at roughly one-quarter of the slowest baseline’s inference cost. The single global scalar suffices empirically: per-patch adaptive-bandwidth alternatives do not improve over it on either a closed-form locality diagnostic or an end-to-end ablation. In a separate arbitrary-scale super-resolution (ASR) extension, GB-LSR achieves competitive PSNR-Y under a canonical-style SR protocol and runs 1.44x faster than LIIF-RDN and 3.25x faster than LTE-SwinIR at x4; within the same extension, a variant trained and evaluated without 4-corner local-ensemble averaging gives a 1.77x speedup with 35% lower peak memory and negligible PSNR change, while additionally widening the RDN encoder from 64 to 96 channels gives a small positive PSNR shift with a 1.58x speedup and 31% lower peak memory. Native-reconstruction claims are scoped to the matched-budget amortized protocol, and ASR claims are scoped to a separate canonical-style SR protocol.
[CV-105] Language-Instructed Vision Embeddings for Controllable and Generalizable Perception
链接: https://arxiv.org/abs/2606.19584
作者: Chengzhi Mao,Xudong Lin,Wen-Sheng Chu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision foundation models are typically trained as static feature extractors, placing the burden of task adaptation onto large downstream models. We propose an alternative paradigm: instead of solely feeding visual features into language models, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference time, removing the need for task-specific retraining. This enables the encoder to focus on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), surpasses vision-language models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks – offering a direct path toward adaptive, instruction-driven visual intelligence.
[CV-106] Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models
链接: https://arxiv.org/abs/2606.19565
作者: Navin Ranjan,Andreas Savakis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose Mix-QVLA, a task-evidence-aware mixed-precision PTQ framework for VLA models. Mix-QVLA anchors each quantized variant to the full-precision action-token reference decision and evaluates whether quantization preserves task-relevant evidence across key VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations and compares full-precision and quantized maps using evidence-mass and attribution-distribution distortion, capturing changes in both the strength and allocation of decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores. Mix-QVLA further models sensitivity throughout task execution, capturing phase-dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence- and time-aware scores guide mixed-precision bit allocation under model-size and BitOps budgets. Extensive evaluations on OpenVLA-style policies show that Mix-QVLA improves the accuracy-efficiency trade-off of low-bit VLA deployment. On LIBERO, Mix-QVLA reduces OpenVLA-OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.
[CV-107] ImageWAM: Do World Action Models Really Need Video Generation or Just Image Editing?
链接: https://arxiv.org/abs/2606.19531
作者: Yuyang Zhang,Wenyao Zhang,Zekun Qi,He Zhang,Haitao Lin,Jingbo Zhang,Yao Mu,Xiaokang Yang,Wenjun Zeng,Xin Jin
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL
Abstract:World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.
[CV-108] LooseControlVideo: Directorial Video Control using Spatial Blocking
链接: https://arxiv.org/abs/2606.19495
作者: Shariq Farooq Bhat,Niloy J. Mitra,Kalyan Sunkavalli
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL
Abstract:Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a “blocking” proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.
[CV-109] LEAP: Layer-skipping Efficiency via Adaptive Progression for Vision Transformer Distillation
链接: https://arxiv.org/abs/2606.19483
作者: Jiaqi Zhang,Ashton Lee,Anthony Wong,John Zou,Sami BuGhanem,Randall Balestriero
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Foundation Models (VFMs) with Vision Transformer (ViT) backbones, such as DINOv2, have become essential for downstream tasks like object recognition and semantic segmentation. The immense computational requirements of backbones often necessitate distillation into smaller architectures for edge deployment. Feature-based knowledge distillation (KD) often suffers from the teacher-student gap; the student struggles to imitate teacher’s complex feature map due to its limited capacity. To mitigate this bottleneck, we propose LEAP: Layer-skipping Efficiency via Adaptive Progression, a training curriculum for ViT feature-based knowledge distillation. By utilizing the teacher’s intermediate feature maps as a sequence of progressively more difficult targets, our curriculum allows the student to build a foundational representation before tackling higher-level abstractions. Our results demonstrate that this paradigm significantly accelerates convergence through adaptive difficulty selection across various student model sizes and dataset scales. With our curriculum, the LEAP-distilled ViT-S achieves 90.1% accuracy on ImageNet-100, a +12.24% improvement compared with baseline. On ImageNet-1K, LEAP achieves +3.84% and +7.75% improvement for the instance retrieval task on the Oxford and Paris datasets, respectively. Furthermore, the curriculum enables 25.1% savings in training FLOPs and 21% savings in training time on ImageNet-100 by implementing early-stopping for teacher inference during the initial stages of training. Code is available at this https URL
[CV-110] Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers
链接: https://arxiv.org/abs/2606.19460
作者: Fabio De Sousa Ribeiro,Emma A.M. Stanley,Charles Jones,Tian Xia,Dominic C. Marshall,Laurent Renard Triché,Christopher V. Cosgriff,Panagiotis Dimitrakopoulos,Sotirios A. Tsaftaris,Ben Glocker
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:We introduce the first generative foundation model for chest radiograph synthesis trained from scratch at the billion-parameter scale. Existing radiographic AI models often suffer from poor generalisation across patient subpopulations, institutions, and acquisition settings, resulting in limited real-world clinical utility. Controlled, high-fidelity synthesis of chest radiographs is a promising path toward diversifying clinical datasets and evaluating the robustness of diagnostic models. Therefore, we present the largest specialist generative foundation model for chest radiographs to date, with over 1.3B parameters, trained for 1.6T tokens on a curated, heterogeneous dataset comprising 1.2M radiographs and clinical expert-guided metadata. Our model supports controllable radiograph generation and editing across multiple demographic subgroups, acquisition views, and a dozen pathologies. Moreover, we significantly advance the state of the art in radiograph synthesis fidelity, producing images that are indistinguishable from real radiographs to clinical experts.
[CV-111] 3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning ICML2026
链接: https://arxiv.org/abs/2606.19451
作者: Ellina Zhang,Madhaven Iyengar,Amir Zadeh,Chuan Li,Deepak Pathak,David Held,Tal Daniel
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICML 2026. Project webpage: this https URL
Abstract:We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at this https URL.
[CV-112] 3D Scene Graphs: Open Challenges and Future Directions
链接: https://arxiv.org/abs/2606.19383
作者: Dennis Rotondi,Francesco Argenziano,Sebastian Koch,Nathan Hughes,Martin Buechner,Johanna Wald,Lukas Rosenberger Schmid,Daniele Nardi,Abhinav Valada,Liam Paull,Federico Tombari,Luca Carlone,Kai O. Arras
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10
Abstract:3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at this https URL.
[CV-113] ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification
链接: https://arxiv.org/abs/2606.19371
作者: Long Doan,Branden Chen,Ethan Litton,Huan Huang,Jiajing Huang,Yixin Xie,Weihua Zhou,Nandakumar Narayanan,Chen Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Alzheimer’s disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.
[CV-114] Human Universal Grasping
链接: https://arxiv.org/abs/2606.17054
作者: Kevin Yuanbo Wu,Tianxing Zhou,Isaac Tu,Billy Yan,Irmak Guzey,David Fouhey,Dandan Shan,Lerrel Pinto
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages, 20 figures, 7 tables
Abstract:Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: this https URL
[CV-115] Contour-Constrained Deformable Registration with Parameter Characterization for Head and Neck Surgical Guidance
链接: https://arxiv.org/abs/2606.19767
作者: Qingyun Yang,Jon S. Heiselman,Ayberk Acar,Morgan J. Ringel,Michael I. Miga,Matthieu Chabanas,Michael C. Topf,Jie Ying Wu
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:With 890,000 annual new cases globally, head and neck squamous cell carcinoma has one of the highest recurrence rates among solid malignancies. Although frozen section analysis is the standard of care for intraoperative margin assessment, accurately relocating detected positive margins on the resection bed remains challenging due to imprecise alignment between resected specimens and their resection bed, compounded by post-resection mucosal tissue shrinkage. We present a biomechanics-driven deformable registration framework that corrects post-resection tissue deformation to provide intraoperative guidance. Our approach registers 3D specimen meshes to intraoperative resection bed point clouds using a deformable registration approach based on regularized Kelvinlet basis functions. The registration matches surface point clouds, fiducial landmarks, and boundary contour constraints that directly penalize perpendicular distance-to-agreement between specimen and resection bed boundaries. Across nine specimens from skin, buccal mucosa, and tongue sites, the overall mean target registration error was 11.11 \pm 4.07 mm using rigid registration, which decreased to 8.20 \pm 2.68 mm (26.19% reduction) using deformable registration without contour constraint. The proposed contour-constrained deformable registration further reduced the error to 5.62 \pm 2.28 mm, a 49.41% reduction relative to rigid registration. We observed the largest reduction in the most clinically challenging tongue specimens. We also performed a systematic two-stage parameter search to characterize the relative importance of surface alignment, fiducial correspondences, contour constraint, and strain energy regularization. This search revealed that contour weighting dominates registration accuracy for tissue types with large lateral deformation, while the algorithm operates over a broad range of parameter combinations.
[CV-116] FrequencyFormer: A Co-Designed Sensor-to-Processor Pipeline for Frequency-Domain Vision Transformer Inference
链接: https://arxiv.org/abs/2606.19574
作者: Chengwei Zhou,Ovishake Sen,Xuming Chen,Rishith Paramasivam,Shaahin Angizi,Swarup Bhunia,Baibhab Chatterjee,Gourav Datta
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying vision transformers (ViTs) on sensor-edge systems is limited not only by on-device compute, but also by the energy and bandwidth required to transmit high-dimensional image data from the sensor to the processor. While in-sensor and near-sensor computing reduce this cost through early feature extraction, existing methods often provide only modest compression. We observe that the frequency domain provides a naturally compact representation of visual information and can be exploited at the sensor level to reduce sensor-to-processor data movement. Building on this insight, we present FrequencyFormer, a co-designed sensor-to-processor pipeline for efficient ViT inference. FrequencyFormer includes: (1) a multi-scale DCT tokenizer that compresses a 224x224 image into compact frequency-domain tokens, achieving up to 128x reduction in off-chip data volume with modest accuracy loss; (2) a LUT-based near-sensor hardware implementation that leverages fixed DCT coefficients for multiplier-free, energy- and area-efficient tokenization; and (3) a modified MIPI-based low-power communication architecture that further reduces transfer energy. FrequencyFormer serves as a drop-in replacement for standard ViT patch embedding and remains compatible with pretrained backbones across classification, detection, and segmentation tasks. The pipeline achieves 28.8 TOPS/W, reduces communication energy by 230x, and lowers total sensor-side energy by 2.22x, demonstrating frequency-domain tokenization as a scalable foundation for in-sensor ViT deployment.
[CV-117] Full-Self Diagnostics (FSD): Physics-Grounded Visual Biomarker Inference from Smartphone Video via Inverse Problems and Operator Learning
链接: https://arxiv.org/abs/2606.19372
作者: Jonathan Thomas,Harsh Thaker
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 38,812 paired scans, preliminary longitudinal validation of multichannel visual glucose inference (MARD 17 to 46 percent across cohorts); physics plus information theory plus operator learning framework
Abstract:We present Full-Self Diagnostics (FSD), a unified mathematical framework for recovering latent physiological states from unconstrained 9-second facial videos captured by consumer smartphones. The approach integrates five mutually reinforcing components: (1) a physics-based forward model derived from the radiative transfer equation and chromophore absorption that maps camera observables to biomarker concentrations; (2) an information-theoretic observability theory proving that multi-channel visual signals (spectral, pulse, respiratory, micro-expression, and oculomotor) contain strictly increasing mutual information with physiological state; (3) a stable, Tikhonov-regularized inverse problem with domain-uniform identifiability guarantees; (4) an operator-learning formulation that enables generalization across devices, resolutions, and populations; and (5) a supervised learning procedure, interpretable as stochastic variational inference, that continuously refines the model from paired biosensor ground truth with performance improving proportionally to one over the square root of the number of paired observations. Empirical validation on 38812 real-world paired scans across 59 subjects demonstrates practical performance. Self-collected data from the lead author (glucose range 35-550 mg/dL) yields MARD of 29.86 percent with 97.57 percent of predictions in Clarke Error Grid Zones A+B and only 0.27 percent in the dangerous Zone E. A well-managed diabetic participant achieves MARD of 17 percent in the narrower 70-180 mg/dL band. These results confirm that consumer-grade facial video encodes sufficient structured information for clinically relevant, non-invasive biomarker inference under fully unconstrained conditions, with performance scaling predictably as more paired data becomes available. Comments: 38,812 paired scans, preliminary longitudinal validation of multichannel visual glucose inference (MARD 17 to 46 percent across cohorts); physics plus information theory plus operator learning framework Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) MSC classes: 35R30, 49N45, 94A17, 68T07 ACMclasses: I.2.6; I.2.10; J.3 Cite as: arXiv:2606.19372 [eess.IV] (or arXiv:2606.19372v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2606.19372 Focus to learn more arXiv-issued DOI via DataCite
人工智能
[AI-0] How Transparent is DiffusionGemma?
链接: https://arxiv.org/abs/2606.20560
作者: Joshua Engels,Callum McDougall,Bilal Chughtai,Janos Kramar,Senthoran Rajamanoharan,Cindy Wu,Arthur Conmy,Asic Q Chen,Jean Tarbouriech,Min Ma,Brendan O’Donoghue,João Gabriel Lopes de Oliveira,Rohin Shah,Neel Nanda
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 main text pages and 6 pages of references and appendices
Abstract:LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model’s computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4.
[AI-1] oward Calibrated Mixture-of-Experts Under Distribution Shift
链接: https://arxiv.org/abs/2606.20544
作者: Gina Wong,Drew Prinster,Suchi Saria,Rama Chellappa,Anqi Liu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Calibration aligns a model’s predictive uncertainty with the frequencies of its empirical outcomes and is important for understanding and trusting reported probabilities. Recent work shows that enforcing calibration at the level of individual predictors can improve ensemble accuracy and calibration, with mixture-of-experts (MoE) models showing strong empirical improvements in particular; however, the conditions under which calibration helps MoE are not well understood. In this work, we study how MoE models behave under distribution shift, focusing on how routing mechanisms interact with expert-level calibration. We show that expert calibration is sufficient to ensure calibration of the overall model under a broad class of distribution shifts in hard-routed models, but is insufficient for calibrating soft-routed models. To address this, we propose an adversarial reweighting that penalizes calibration errors of the routed aggregate under distribution shift, and we demonstrate that it improves the accuracy-calibration tradeoff both on average and on difficult subsets of the data, across model classes, prediction tasks, and distribution shifts.
[AI-2] How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech
链接: https://arxiv.org/abs/2606.20532
作者: Nityanand Mathur,Hamees Sayed,Wasim Madha,Apoorv Singh,Sameer Khurana,Akshat Mandloi,Sudarshan Kamath
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the first time, and apply it to CapSpeech-TTS. Our method extracts per-token heatmaps across 25 layers and 24 ODE steps. We analyze 3,600 (style caption, text transcript) combinations comprising 120 style captions conditioning the generation of 30 text transcripts each, revealing how caption tokens shape waveforms. Results show: (1) style tokens have lower temporal variance than content/function tokens, confirming global conditioning; (2) style attention correlates with F0 and energy; (3) style conditioning peaks in early steps and deep layers; (4) attention entropy reaches its minimum at layer 17, co-occurring with the style importance peak, indicating maximal network selectivity at the most style-critical stage. This is the first study of how natural language influences cross-attention in speech diffusion models
[AI-3] DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs
链接: https://arxiv.org/abs/2606.20526
作者: Saimun Habib,Vaishak Belle,Fengxiang He
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference is associational. Counterfactual reasoning additionally requires a causal semantics for interventions and evidence. We introduce DeepSWIP, a single-world counterfactual semantics for DeepProbLog programs. Using neural materialization, we reduce fixed-context neural predicates to ordinary ProbLog choices, apply Single World Intervention Programs (SWIPs), and compute counterfactuals by weighted model counting (WMC) over a single transformed program. Under finite grounding and unique-supported-model assumptions, DeepSWIP is exact relative to the learned materialized FCM. The standard quotient-WMC form of ProbLog conditionals identifies active neural probabilities and explains intervention cleaning, calibration sensitivity, and rare-evidence instability. Experiments on MPI3D confirm the transformation against a DeepTwin construction against 12,000 queries, as predicted and a 2.14 \times inference speedup from avoiding the Twin’s endogenous duplication. A SUMO HOV experiment shows that neural calibration degradation biases plug-in estimates, while a correctly scoped randomized-policy AIPW estimator removes most first-order bias for population mean and ATE estimands. Code is at this https URL.
[AI-4] Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agent ic Control Planes
链接: https://arxiv.org/abs/2606.20520
作者: Jun He,Deying Yu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 19 pages, 6 figures, 10 tables
Abstract:Autonomous agents are increasingly connected to cloud, deployment, and data-control workflows, but production mutation authority should not reside inside non-deterministic reasoning processes. Existing access-control mechanisms authorize identities, while assurance layers certify proposed actions; neither alone provides a mandatory enforcement point for certified authority at the moment of mutation. This paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary for certificate-bound agentic infrastructure. SEB consumes certificates issued by the Sovereign Assurance Boundary (SAB), verifies that the requested mutation matches the certified execution contract, checks validity windows, policy epochs, revocation epochs, and live-state drift, mints scoped execution identity, invokes infrastructure APIs, and records signed decision and outcome records. By separating proposal, admission, and execution, SEB turns certified authority into a short-lived, revocable, auditable runtime capability, provided that production mutation APIs reject non-broker identities. We present the SEB execution model, certificate and replay-verification predicates, scoped identity semantics, bypass-prevention deployment patterns, failure behavior, and a concrete prototype implementation. We evaluate the prototype on AWS and Kubernetes clusters, measuring latency overheads, revocation propagation, drift detection, and security under fault injection.
[AI-5] FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS
链接: https://arxiv.org/abs/2606.20518
作者: Harshit Singh,Ayush Pratap Singh,Nityanand Mathur
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Flow-matching text-to-speech systems achieve remarkable zero-shot quality but remain static after deployment: pronunciation errors on out-of-vocabulary proper nouns persist unless the model is retrained. We introduce FlowEdit, a life-long adaptation framework for frozen flow-matching TTS that learns pronunciation corrections as latent conditioning edits rather than weight updates. When corrective feedback is provided, FlowEdit optimizes a token-level perturbation in the text embedding space, then stores the correction in a Modern Hopfield Network serving as content-addressable episodic memory. At inference, corrections are retrieved via soft attention with a similarity gate, enabling fuzzy morphological matching. On our curated benchmark of 312 multilingual proper nouns across 18 language families, FlowEdit reduces target-word Phoneme Error Rate by 92.7% relative to the zero-shot baseline while maintaining identical general-speech quality. Corrections complete in approximately 15 seconds on a single GPU.
[AI-6] Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages ICLR2026
链接: https://arxiv.org/abs/2606.20517
作者: Maria Ivanova,Pavel Zadorozhny,Rodion Levichev,Ivan Petrov,Adamenko Pavel,Ivan Lopatin,Alexey Kutalev,Dmitrii Babaev
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: ICLR 2026
Abstract:LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB’s contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB’s primary limitation and exposing critical gaps in current LLM capabilities. Comments: ICLR 2026 Subjects: Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2606.20517 [cs.AI] (or arXiv:2606.20517v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.20517 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-7] Efficient and Sound Probabilistic Verification for AI Agents
链接: https://arxiv.org/abs/2606.20510
作者: Alaia Solko-Breslin,Pramod Kaushik Mudrakarta,Mihai Christodorescu,Somesh Jha,Krishnamurthy Dj Dvijotham
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Securing AI agents that operate in complex digital environments has become a critical need, and runtime monitoring approaches that formulate and enforce policies expressed in a formal language like Datalog offer a promising solution. However, existing approaches are restricted to deterministic policies. In many practical applications of AI agents, there is a need to enforce security policies in the face of ambiguity, leading to probabilistic predicates or state transitions (for example, a declassifier or Personally Identifiable Information (PII) detector that has some failure probability on each invocation). Furthermore, in many such applications, one cannot easily make the independence assumptions necessary to invoke prior work on probabilistic inference in Datalog. We address this by introducing a sound and efficient framework for such verification based on distributionally robust optimization, computing sound upper bounds on the probability of policy violation regardless of possible correlations between predicates. On standard benchmarks for terminal and tool calling agents, we demonstrate that our approach outperforms prior art and improves the security-utility trade-off while ensuring rigorous bounds on the probability of policy violation.
[AI-8] What Do Safety-Aligned LLM s Learn From Mixed Compliance Demonstrations?
链接: https://arxiv.org/abs/2606.20508
作者: Sihui Dai,Mann Patel
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.
[AI-9] Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLM s for Vulnerability Detection in Systems Software
链接: https://arxiv.org/abs/2606.20502
作者: Arastoo Zibaeirad,Marco Vieira
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable–patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.
[AI-10] UltraQuant: 4-bit KV Caching for Context-Heavy Agents
链接: https://arxiv.org/abs/2606.20474
作者: Inesh Chakrabarti(1 and 2),David Limpus(1 and 3),Aditi Ghai Rana(1),Bowen Bao(1),Spandan Tiwari(1),Thiago Crepaldi(1),Ashish Sirasao(1) ((1) Advanced Micro Devices, (2) University of California, Los Angeles, (3) Purdue University)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 11 pages, 9 figures
Abstract:Context-heavy agents place unusual pressure on the key-value (KV) cache: long prefixes are reused across many short turns, while concurrency determines whether the serving system can keep GPUs utilized. We study 4-bit KV-cache compression for this setting, using TurboQuant-style rotation and codebook quantization as a quality anchor and vLLM FP8 KV caching as the deployment anchor. We report three contributions. First, we frame 4-bit KV caching around multi-round agent workloads where task quality, cache residency, and serving throughput must be measured jointly. Second, we describe the practical design choices needed to make the 4-bit path robust, including asymmetric K/V treatment, Walsh-Hadamard rotation, QJL removal, and block-scale variants. Third, we present serving optimizations on AMD GPUs, including optimized decode-attention kernels and UltraQuant, an FP4 approximation path that uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 time-to-first-token by 3.47x in the cache-pressured late rounds (2.3x across all rounds) and raises output throughput by 1.63x over the FP8 KV baseline.
[AI-11] Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agent ic AI Systems
链接: https://arxiv.org/abs/2606.20470
作者: Reza Soosahabi,Vivek Namsani
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker’s automated judge. Our analysis shows that conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search. We then examine detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker’s judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR. We evaluate a proof-of-concept realization of this strategy through Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
[AI-12] Context-Aware Hierarchical Bayesian Modeling of IVF Laboratory Environmental Conditions
链接: https://arxiv.org/abs/2606.20459
作者: Zahra Asghari Varzaneh,Reza Khoshkangini,Pia Saldeen,Lars Johansson,Thomas Ebner
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:IVF pregnancy rates are routinely modeled using patient-level variables, while high-resolution laboratory environmental data remain underutilized. We show that this is a missed opportunity. Rather than relying on raw sensor averages, we engineer 55 context-aware temporal features, including rolling thermal stability, simultaneous temperature-humidity adherence, peak stress duration, and post-stress recovery speed, that capture the dynamics of incubator microenvironments. On 61 weeks of data from an Asian IVF clinic, these features reduce cross-validated prediction error to 1.27%, compared to 3-5% for raw averages. We then train a hierarchical Bayesian Beta regression model that shares environmental effects across an Asian and a Northern European clinic via partial pooling, while preserving site-specific baselines. On held-out data from the Northern European clinic, the model achieves R2 = 0.86 and a 64% error reduction for the 35-39 age group over a naive baseline, demonstrating that structured environmental monitoring contains clinically meaningful, transferable signal.
[AI-13] Interpretable Sperm Morphology Classification via Attention-Guided Deep Learning
链接: https://arxiv.org/abs/2606.20438
作者: Zahra Asghari Varzaneh,Reza Khoshkangini,Thomas Ebner,Lars Johansson
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Male infertility is a major cause of couple infertility, often linked to abnormal sperm morphology. While deep learning models offer automated analysis, most lack interpretability, limiting their clinical adoption. This study proposes an attention-guided deep learning framework for sperm morphology classification. We combine a pretrained EfficientNet-B0 with a Convolutional Block Attention Module (CBAM) to focus on key areas of the sperm head, improving both accuracy and interpretability. Evaluated on the SMIDS and HuSHem public datasets, our model achieves accuracies of 90.2% and 93.9% (macro F1 scores of 0.913 and 0.948), outperforming SimpleCNN and standard EfficientNet-B0. Furthermore, we use Grad-CAM++ visualizations to highlight features influencing the model’s decisions. The results demonstrate that this accurate and transparent framework is a practical tool for automated sperm analysis in fertility clinics.
[AI-14] Multi-View Decompilation for LLM -Based Malware Classification
链接: https://arxiv.org/abs/2606.20436
作者: Bercan Turkmen,Vyas Raina
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of benign utilities and malicious programs spanning a range of threat behaviors. Each sample is compiled and decompiled with both Ghidra and RetDec, yielding matched pseudo-C views. Across a range of LLMs from major model families, we find that providing both decompiler views improves malicious-class F1, mainly by increasing recall on malicious samples. Agreement analyses further show that Ghidra and RetDec make partially different errors, supporting the view that decompiler outputs provide complementary evidence. Our results suggest that multi-decompiler prompting is a simple, training-free way to improve LLM-based malware triage in practical settings.
[AI-15] LLM agent safety multi-turn red-teaming jailbreak benchmarks adversarial robustness safety-critical systems
链接: https://arxiv.org/abs/2606.20408
作者: Hanwool Lee,Dasol Choi,Bokyeong Kim,Seung Geun Kim,Haon Park
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical safety functions (CSFs), while adversaries inject messages over four channels in bounded multi-turn sessions with per-turn feedback. Harm is an objective signal rather than LLM-judged text: a run terminates the moment any CSF is lost, attributed to the causing message. Evaluating four frontier operator models under a fixed-attack paired-replay protocol, we find that adaptive multi-turn attacks reliably push the operator team past a safety limit: across the four models, between 8.7% and 12.1% of attack sessions end with the plant losing a critical safety function. Although the four models look almost equally robust by this aggregate rate, their failures barely overlap: of 149 sessions, none defeat all four models while a third defeat at least one, so vulnerabilities are nearly disjoint across models rather than nested. The effect of added defences is strongly model-dependent: the same guardrail stack or safety-advisor agent that lowers attack success for one model can raise it for another. We release the simulation venue, attack dataset, and replay tooling for reproducible safety evaluation of LLM agents.
[AI-16] Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin Systemic Impact and UFP4 Recipe
链接: https://arxiv.org/abs/2606.20381
作者: Qian Zhao,Kunlong Chen,Changxin Tian,Zhonghui Jiang,Haitao Zhang,Chaofan Yu,Peijie Jiang,Mingliang Gong,Jia Liu,Ziqi Liu,Zhiqiang Zhang,Jun Zhou
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 12 figures
Abstract:FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.
[AI-17] CRAX: Fast Safe Reinforcement Learning Benchmarking
链接: https://arxiv.org/abs/2606.20376
作者: Tristan Tomilin,Mourad Boustani,Mickey Beurskens,Thiago D. Simão
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety is a core concern for deploying reinforcement learning (RL) agents in real-world domains such as robotics and autonomous driving. While benchmarks have been central to progress in RL, existing safety benchmarks with high-fidelity 3D physics remain computationally slow, limiting large-scale experimentation and rapid prototyping. To address this gap, we propose CRAX (Constrained RL Accelerated with JAX). Built on top of the MuJoCo XLA (MJX) physics engine with realistic 3D dynamics, CRAX leverages vectorized operations and hardware acceleration, yielding up to ~100x speedups over comparable CPU-based safety benchmarks. The benchmark features six environment suites and three agent-specific tasks, each spanning three difficulty levels. Evaluating six popular safe RL methods shows that no single approach dominates across all tasks, and reveals the trade-offs between performance and safety. We find that curriculum learning across difficulty levels and safety transfer can improve performance over direct training in harder settings.
[AI-18] AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning
链接: https://arxiv.org/abs/2606.20373
作者: Zepeng Li,Jie Ren,Zhanyong Tang,Jie Zheng,Zheng Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi-agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM-generated optimization decisions. Rather than treating the compiler as a black box like prior auto-tuning schemes, AutoPass opens up the compiler to the LLM, enabling it to query compiler-internal optimization states and analyze the intermediate representation to orchestrate compiler options. The search process iteratively refines optimization configurations using measured runtime feedback to diagnose regressions and guide latency-improving edits. AutoPass operates in an inference-only, training-free setting and requires no offline training or task-specific fine-tuning, making it readily applicable to new benchmarks and platforms. We implement AutoPass on the LLVM compiler and evaluate it on server-grade x86-64 and embedded ARM64 systems. AutoPass outperforms expert-tuned heuristics and classical autotuning methods, achieving geometric-mean speedups of 1.043x and 1.117x over LLVM -O3 on x86-64 and ARM64, respectively.
[AI-19] Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining
链接: https://arxiv.org/abs/2606.20363
作者: Yuexing Hao,Xiaomin Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5% to 20.5%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.
[AI-20] SoftSkill: Behavioral Compression for Contextual Adaptation
链接: https://arxiv.org/abs/2606.20333
作者: Xijia Tao,Yihua Teng,Xinyu Fu,Ziru Liu,Kecheng Chen,Yuzhi Zhao,Suiyun Zhang,Rui Liu,Lingpeng Kong
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agent skills are commonly deployed as natural-language Markdown files that encode answer policies, evidence-use habits, and task procedures. These files are readable and portable, but they are consumed indirectly: for each task instance, a frozen language model must translate a long textual artifact into generation-time behavior. This paper asks whether a natural-language skill can instead initialize a compact continuous context object, refined by a trainable soft delta while the base model remains frozen. We propose SoftSkill, a frozen-backbone method that tunes such soft skills with next-token prediction and deploys them as latent behavioral priors at inference time. In our main single-round setting, a length-32 SoftSkill prefix on Qwen3.5-4B improves over no-skill prompting by 8.3 points on SearchQA, 42.1 points on LiveMath, and 1.3 points on DocVQA. Relative to SkillOpt, SoftSkill improves accuracy by 5.2 points on SearchQA and 12.5 points on LiveMath, while replacing hundreds to thousands of Markdown skill tokens with a few virtual tokens. We further study agentic execution as a harder boundary case, where sparse trajectory imitation provides useful signal but does not yet robustly compress long-horizon procedural behavior. More broadly, the results suggest that some task skills are better treated not as additional Markdown to be reinterpreted at inference time, but as compact latent controls over how a frozen model enters the task.
[AI-21] Leverag ing systems non-linearity to tackle the scarcity of data in the design of Intelligent Fault Diagnosis Systems
链接: https://arxiv.org/abs/2606.20323
作者: Giancarlo Santamato,Andrea Mattia Garavagno,Massimiliano Solazzi,Antonio Frisoli
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Transfer Learning (DTL) allows for the efficient building of Intelligent Fault Diagnosis Systems (IFDS). On the other hand, DTL methods still heavily rely on large amounts of labelled data. Obtaining such an amount of data can be challenging when dealing with machines or structures faults. This document proposes a novel approach to the design of vibration-based IFDS using DTL in condition of strong data scarcity. A periodic multi-excitation level procedure leveraging intrinsic non-linearities of real-world systems is used to produce images that can be conveniently analysed by pre-trained Convolutional Neural Networks (CNNs) to diagnose faults. A new data visualization method and its augmentation technique are proposed in this paper to tackle the typical lack of data encountered during the design of IFDS. Experimental validation on a railway pantograph structure provides effective support for the proposed method.
[AI-22] Boundary Embedding Shaping with Adaptive Contrastive Learning for Graph Structural Disentanglement ICML2026
链接: https://arxiv.org/abs/2606.20283
作者: Jiaqing Chen,Zidu Yin,Yichao Cai,Yuhang Liu,Zhen Zhang,Dong Gong,Javen Qinfeng Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026
Abstract:Graph neural networks (GNNs) excel at aggregating neighbor information for classification, yet their performance is hindered by graph structural entanglement, where spurious correlations from semantically irrelevant neighbors contaminate node embeddings. This challenge is most acute for nodes near class boundaries in the embedding space, where amplified structural noise blurs decision boundaries and destabilizes predictions. Existing robust GNN methods largely treat all nodes uniformly, ignoring boundary vulnerabilities. In this paper, to improve classification performance, we tackle graph structural disentanglement by identifying boundary-region entanglement as the primary bottleneck and propose Boundary Embedding Shaping (BES), an adaptive contrastive learning GNN plug-in module that selectively suppresses spurious structural noise at decision boundaries with minimal model parameter perturbation. Extensive experiments demonstrate that BES consistently improves boundary discrimination and outperforms existing leading methods. Notably, BES boosts GCN performance by an average of 3.3% in node classification (up to 5.0% on WikiCS) and achieves superior accuracy in link prediction.
[AI-23] Lagrange: An Open-Vocabulary Energy-Based Sparse Framework for Generalized End-to-End Driving
链接: https://arxiv.org/abs/2606.20274
作者: Shihao Ji,HongXi Li,Zihui Song,Mingyu Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling end-to-end autonomous driving to complex, open-world environments requires perceptual models that generalize to anomalous scenarios and planners that produce kinematically valid trajectories. Existing paradigms face a distinct dichotomy between representational efficiency and generalization capacity. Dense models (e.g., occupancy networks), while geometrically robust, incur critical computational bottlenecks and struggle with high-level semantic reasoning. Conversely, sparse, query-based planners are efficient but reliant on closed-set definitions, rendering them vulnerable to out-of-distribution (OOD) events. Although recent Vision-Language-Action (VLA) models offer open-vocabulary reasoning, their autoregressive, discrete token generation fundamentally conflicts with the continuous, high-frequency control requirements of vehicle dynamics. To address this, we propose Lagrange, an open-vocabulary, computationally sparse driving framework based on Masked Latent Fields (MLF). Rather than relying on dense volumetric reconstructions or closed-set query mechanisms, Lagrange exploits Vision-Language Models (VLMs) to encode class-agnostic object proposals into continuous semantic visual tokens. We introduce an intent-driven masked cross-attention module that temporally filters irrelevant entities, decoding the attended tokens into an implicit continuous energy field defined over spatial coordinates. By framing decision-making as a Lagrangian action minimization problem spanning this energy field, we enforce strict compliance with vehicle kinematics while executing collision avoidance. Extensive offline evaluations on both standard (nuScenes) and long-tail (CODA) benchmarks demonstrate that Lagrange establishes a promising framework for robust, interpretable, and kinematically feasible open-world autonomy.
[AI-24] Confidence-Aware Automated Assessment of Student-Drawn Scientific Models
链接: https://arxiv.org/abs/2606.20264
作者: Luyang Fang,Yingchuan Zhang,Jongchan Park,Zhaoji Wang,Ping Ma,Xiaoming Zhai
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Student-generated drawings are widely used in science education to assess learners’ conceptual understanding in modeling-based tasks aligned with the Next Generation Science Standards (NGSS). However, scoring such drawings requires expert human judgment to interpret complex visual representations, making large-scale assessment costly to implement and sustain in classroom settings. In this work, we study automated scoring of student-generated scientific drawings using a vision-based model. We evaluate a Vision Transformer (ViT) with parameter-efficient adaptation and propose a confidence-aware scoring framework that derives response-level confidence from test-time predictive distributions. This confidence signal enables selective automation by scoring high-confidence responses automatically while deferring uncertain cases for human review. Experiments on six NGSS-aligned middle school assessment items show that the proposed approach improves scoring reliability while supporting a practical trade-off between automated coverage and scoring risk, highlighting the value of confidence-aware methods for trustworthy educational assessment.
[AI-25] Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think
链接: https://arxiv.org/abs/2606.20246
作者: Gia-Binh Nguyen,Trong-Bao Ho,Thien-Loc Ha,Khoa Vo,Philip Lund Møller,Quang T. Nguyen,Long Dinh,Tuan Dam,Vu Duong,Tung M. Luu,Trung Le,Tran Nguyen Le,Minh Vu,An Thai Le,Ngan Le,Daniel Sonntag,James Zou,Jan Peters,Duy M. H. Nguyen,Ngo Anh Vien
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.
[AI-26] Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference
链接: https://arxiv.org/abs/2606.20245
作者: Huang Peng,Jiuyang Tang,Weixin Zeng,Hao Xu,Xiang Zhao
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
Abstract:Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model’s internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approaches typically assume that either the model or the provided context is reliable, overlooking the possibility that both sources may contain errors, and avoid conflicts by privileging one source over the other, rather than actively resolving inconsistencies. To address these limitations, we propose a novel framework MACR for LLM knowledge conflict resolution that moves beyond the conventional binary choice paradigm and incorporates an explicit conflict-resolution mechanism based on a multi-agent reasoning approach. Specifically, we first propose an adaptive knowledge assessment and retrieval approach that employs a modified semantic entropy measure to quantify an LLM’s confidence in its answer to a given query. Based on this confidence estimation, MACR either externalizes the model’s internal knowledge as textual representations or retrieves relevant external knowledge when internal knowledge is insufficient, generating basic contexts for subsequent reasoning. Then we introduce an inductive multi-agent reasoning framework with three specialized agents that, respectively, induce explicit rules, analyze potential conflicts, and resolve inconsistencies across all available contexts. Empirical results demonstrate that MACR significantly outperforms state-of-the-art baselines across benchmarks, while also providing interpretable resolutions of explicit conflicts.
[AI-27] hermodynamic Measure of Intelligence
链接: https://arxiv.org/abs/2606.20231
作者: Ishanu Chattopadhyay
类目: Artificial Intelligence (cs.AI); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Mathematical Physics (math-ph); Adaptation and Self-Organizing Systems (nlin.AO)
备注:
Abstract:Can intelligence be measured? We propose that intelligence can be defined as the lawful amplification of rare but valid futures: a system increases the probability of outcomes that would be unlikely under passive dynamics but remain admissible under the constraints of the domain. We start with the premise that an intelligent system must model the world and its own place within it. Because the system is part of the world it models, this leads naturally to recursive self-simulation: the system represents futures in which its own actions are part of the trajectory. Our central results give a necessity statement and a conditional near-sufficiency statement connecting this architecture to a precise thermodynamic measure of lawful amplification of rare-valid futures: high rare-valid lift is impossible unless the internal simulation identifies rare-valid futures with high fidelity; conversely, when rare-valid fidelity is high and the simulation contains an effective policy, the achievable lift approaches the actuation-limited optimum. Thus recursive self-simulation is not merely a plausible feature of intelligence but, under the stated assumptions, is necessary and nearly sufficient for high thermodynamic intelligence. The resulting framework makes intelligence measurable on a universal scale, from passive matter and feedback controllers, large language models, and humans as text generators to Maxwell-demon-like information engines.
[AI-28] QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation
链接: https://arxiv.org/abs/2606.20227
作者: Xinyi Zheng,Ling Shi,Tianlong Yu,Yongxin Zhao,Lorenz Goette,Kailong Wang
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with quantifiable and controllable complexity. It constructs formal logical structures using conjunction and disjunction patterns, enabling precise control over reasoning depth, width, label types, and distractors. These structures are then translated into natural language via LLMs, with logical consistency ensured through round-trip verification using an external prover. Based on our framework, we build QMFOLBench, a benchmark comprising 2880 instances with 960 configurations across diverse logical and semantic dimensions. Evaluations on six large reasoning models (LRMs) and two LLMs show that performance degrades and computational overhead increases with rising logical complexity. Models perform better on True-labeled tasks than on False or Unknown ones, and exhibit sensitivity to semantic variation. Overall, QMFOL offers a scalable and reliable approach for constructing deductive reasoning benchmarks with controllable complexity, enabling more precise evaluation of reasoning capabilities in modern language models. Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2606.20227 [cs.AI] (or arXiv:2606.20227v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.20227 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-29] Learner-based Concept Drift Detection: Analysis and Evaluation
链接: https://arxiv.org/abs/2606.20216
作者: Md Moman Ul Haque Khan,Samira Sadaoui
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 2 authors, 29 pages
Abstract:Machine learning algorithms deployed for evolving streaming environments must handle the non-stationary data distributions, commonly referred to as concept drift. The presence of concept drift poses a major challenge for many real-world applications because it can severely degrade their predictive performance, hindering their ability to support robust decision-making. Consequently, the timely and efficient detection of drift events is critical for sustaining high accuracy over time. This study examines theoretically the concept drift characteristics and numerous drift detection algorithms across several categories. Furthermore, we evaluate their performance on both synthetic and real-world datasets exhibiting diverse streaming scenarios and drift characteristics, such as abrupt and gradual changes. This study aims to enhance understanding of the complex notion of concept drift characteristics and behavior of drift detectors, along with their applicability to diverse contexts.
[AI-30] Augmenting Game AI with Deep Reinforcement Learning
链接: https://arxiv.org/abs/2606.20210
作者: Alessandro Sestini,Joakim Bergdahl,Amir Baghi,Jean-Philippe Barrette-LaPierre,Florian Fuchs,Linus Gisslén
类目: Artificial Intelligence (cs.AI)
备注: Vision paper, published in Conference on Games 2026
Abstract:Immersion in video games depends not only on graphics, audio, and game mechanics, but also on the quality of in-game characters. Producing believable characters, or game AI, remains a significant challenge as behavioral complexity is hard to capture with hand-coded systems. Game AI is a source of immersion and engagement; however, the limitations stemming from the challenges of creating game AI often lead to frustration and the breaking of the illusion of realism within the game. The introduction of machine learning models opens the door to creating more believable, authentic, and relatable characters in games. The promise is that they either learn from interacting with the game, or from player data, to develop true human-like behavior. In this paper, we envision more applications of reinforcement learning for game AI in the future. For this to materialize, current research limitations are prohibitive to broad deployment across game genres. Therefore, we propose a framework for training reinforcement learning models with a set of requirements in mind that are suited towards game AI and game development. We present examples of games with reinforcement learning-augmented game AI and describe the practicalities of deploying player-facing machine learning agents in modern games. Furthermore, we identify bottlenecks and hard problems in these areas, which we believe offer promising research directions to accelerate the adoption of machine learning in game AI for the video game industry.
[AI-31] FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching
链接: https://arxiv.org/abs/2606.20209
作者: Francesco Argenziano,Miguel Saavedra-Ruiz,Sacha Morin,Charlie Gauthier,Daniele Nardi,Liam Paull
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Joint spatial and temporal understanding of 3D scenes is a crucial requirement for robots deployed in everyday household environments. Such agents must not only comprehend and navigate spatial layouts, but also reason about how these spaces evolve over time. In particular, humans interact with objects daily, causing them to change position throughout the environment and making it difficult for robots to reliably associate current observations with previously seen objects. However, these interactions are not random: human habits and routines induce spatio-temporally consistent patterns in object locations, which robotic agents can potentially learn and then exploit for downstream tasks such as navigation. To this end, we introduce FlowMaps, a latent flow matching model for estimating multimodal distributions over the future locations of dynamic objects in a continuous 3D space. By learning the implicit dependencies among objects and their temporal evolution, FlowMaps predicts likely changes in object locations conditioned on past human interactions, while supporting generalization across previously unseen environments that share similar object routines. To demonstrate the utility of this method, we deploy FlowMaps in a downstream dynamic Object Navigation task in both simulated and real-world environments. Across more than 600 episodes, FlowMaps outperforms state-of-the-art approaches, showing that modeling object dynamics through continuous, multimodal spatio-temporal distributions improves robotic search and navigation in changing household environments. Code and additional material is available at this https URL.
[AI-32] Beyond Accuracy: Measuring Logical Compliance of Predictive Models
链接: https://arxiv.org/abs/2606.20208
作者: Guillaume Olivier Delplanque,Pierre Genevès(LIG),Nabil Layaïda(LIG,TYREX),Zephirin Faure
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Machine learning models are predominantly evaluated through predictive performance metrics such as ranking quality, prediction error, or classification accuracy. While these metrics effectively quantify how closely predictions match the ground truth, they do not assess whether model outputs respect predefined logical or domain-specific constraints. In high-stakes applications, including healthcare, finance, and autonomous systems, logical consistency can be as critical as predictive accuracy, yet no standard metric captures this dimension. We introduce the Rule Violation Score (RVS), a complementary evaluation metric that quantifies the extent to which a predictive model respects a given set of logical rules, independently of predictive accuracy. RVS treats hard rules (strict constraints) and soft rules (statistical regularities) differently, can be evaluated on any dataset and on any predictive model expressed over a relational vocabulary, and can be computed using SQL queries that are automatically generated for Horn rules. Beyond evaluating models, RVS can also evaluate the logical consistency of training datasets and help identify poorly defined rules. We evaluate RVS on three benchmarks covering knowledge graph link prediction and relational regression, including rule-based, embedding-based, and neuro-symbolic predictive models. Our results demonstrate that two models achieving comparable predictive accuracy can exhibit substantially different levels of logical compliance, revealing differences in model behavior that standard metrics fail to capture.
[AI-33] Implicit Semantic-Aware Communication Based on Hypergraph Reasoning
链接: https://arxiv.org/abs/2606.20162
作者: Yiwei Liao,Shurui Tu,Yong Xiao,Yingyu Li,Guangming Shi
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
备注: This work is accepted at IEEE Transactions on Communications
Abstract:Semantic-aware communication has emerged as a transformative paradigm for next-generation communication systems, shifting the fundamental goal from transmitting bit-level symbols to reliably recovering and understanding the semantic meaning of information. Previous studies have demonstrated that representing the semantic content of source messages as graph-based structures can significantly improve communication efficiency and the accuracy of semantic inference at the receiver. However, existing solutions typically employ graphs that capture only pairwise relationships, thereby neglecting higher-order implicit correlations commonly observed in real-world scenarios, such as group interactions, multi-entity associations, and complex relational contexts. This limitation reduces semantic expressiveness and makes semantic inference susceptible to ambiguity and performance degradation, particularly under noisy or corrupted channel conditions. To address these issues, this paper proposes a novel hypergraph-based implicit semantic reasoning framework, HISR, which leverages hypergraphs to represent complex multi-entity relationships among semantic knowledge entities. In HISR, entities and their associated higher-order relations are mapped into dedicated semantic subspaces tailored to distinct relational contexts. This design not only disentangles diverse semantic interactions to mitigate the over-smoothing effects commonly found in traditional graph embedding methods but also enables robust semantic inference even when partial information loss occurs during transmission. Numerical results show that the proposed HISR achieves up to a 36.6% improvement in implicit semantic interpretation accuracy over the state-of-the-art benchmarks.
[AI-34] Modularity-Free Conflict-Averse Training for Generalized PINNs ICASSP2026
链接: https://arxiv.org/abs/2606.20156
作者: Heejo Kong,Beomchul Park,Sung-Jin Kim,Seong-Whan Lee
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026
Abstract:Physics-informed neural networks (PINNs) have become a powerful framework for solving PDEs by embedding physical laws into differentiable objectives. Despite their advances, training PINNs remains fragile: recent conflict-averse optimization schemes alleviate gradient interference between residual and boundary losses, but we show that their effectiveness deteriorates as model capacity increases. In this paper, we identify a capacity-induced failure mode, where overparameterized networks undergo functional modularity, self-partitioning into task-exclusive modules that suppress cross-objective interaction and hinder convergence toward Pareto-stationary points. To address this issue, we propose a novel framework, Modular-Sparsity Synchronization (ModSync), which integrates structural optimization into conflict-averse training by penalizing task-exclusive connections while preserving interaction-promoting pathways. Extensive experiments across diverse PDE benchmarks demonstrate that ModSync consistently prevents capacity-driven failures, sustains robust cross-objective coupling, and achieves state-of-the-art accuracy. Codes are available at \urlthis https URL.
[AI-35] Hybrid ANN-SNN Pipeline with Local Plasticity
链接: https://arxiv.org/abs/2606.20151
作者: Denis Larionov,Khairutin Shtanchaev,Mikhail Kiselev,Mikhail Korovin,Ivan Tugoy
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figues, source-code available
Abstract:This work proposes a hybrid ANN-SNN pipeline that effectively leverages the rich embeddings of pretrained artificial neural networks (ANNs) to enable high-performance spiking neural networks (SNNs). The architecture couples a pretrained EfficientNet encoder with a CoLaNET spiking classifier. We convert the encoder’s activations into spike trains via rate-coding and train the subsequent SNN classifier using local, biologically inspired learning rules, bypassing end-to-end gradient propagation. This approach achieves 99.09% accuracy on a 64-class ImageNet benchmark, demonstrating performance on par with conventional deep networks. The work presents a biologically plausible and efficient framework for adapting powerful pretrained encoders to downstream spiking neural network tasks.
[AI-36] BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling
链接: https://arxiv.org/abs/2606.20146
作者: Bharathi Kannan Nithyanantham,Clemens Kujat,Tobias Sesterhenn,Stefan Telgmann,Jörn Plönnigs,Stefan Lüdtke,Christian Bartelt
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness. We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure. BIM-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes. Tasks are expressed using three instruction categories - direct, spatial, and topological - covering both explicit and scene-grounded edits. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency. Across evaluated LLMs, the best-performing model achieves only 49.5% average score across the three metrics, and no model fully solves more than 3.4% of tasks. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows.
[AI-37] Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation
链接: https://arxiv.org/abs/2606.20135
作者: Jianing Guo,Fangzheng Chen,Zihao Mao,Wong Lik Hang Kenny,Zhenhong Wu,Yu Li,Yishuai Cai,Yuanpei Chen,Yikun Ban,Kai Chen,Qi Dou,Yaodong Yang,Xianglong Liu,Huijie Zhao,Simin Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at this https URL.
[AI-38] Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform
链接: https://arxiv.org/abs/2606.20120
作者: Hyeonna Choi,Jung Yup Kim,Hyuneui Lim,Seunggyu Jeon
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Biological experiment protocols are written in natural language, whereas automation systems rely on predefined control commands, creating a semantic gap that limits autonomous execution. Microplate-based automatic experiments are particularly challenging due to the need to simultaneously control well mapping, sample-reagent combinations, replicate placement, and parallel dispensing. This study proposes an agent-based protocol translation framework that converts natural-language microplate-based protocols into executable control commands for a robotic laboratory platform. A Parser Agent formalizes the natural-language protocol into a structured representation, and a rule-based mapping engine deterministically incorporates the operational constraints of the robotic laboratory platform to generate device-level control commands. A heterogeneous LLM Validation Agent verifies completeness, parameter accuracy, and execution order, and triggers a self-correction loop with structured feedback when errors are detected. A sweep involving 7 Parsers and 3 Validators on randomly selected ELISA protocols evaluates how model scale and Validator type affect translation accuracy and pass rates under cross-model verification. The accuracy-latency trade-off is further verified by comparing the rule-based mapping of the proposed framework with LLM end-to-end direct mapping. Finally, Bradford assay-based protein quantification using a microplate was demonstrated on a robotic laboratory platform, validating end-to-end autonomous execution from natural-language protocols to real-world experiments. The proposed framework provides a flexible approach to narrowing the semantic gap between natural-language protocols and microplate-based self-driving laboratories.
[AI-39] Sensorimotor World Models: Perception for Action via Inverse Dynamics
链接: https://arxiv.org/abs/2606.20104
作者: Petr Ivashkov,Randall Balestriero,Bernhard Schölkopf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.
[AI-40] Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow
链接: https://arxiv.org/abs/2606.20101
作者: Liting Gao,Yonggang Zhu,Yaru Chen,Dongyu Wang,Shubin Zhang,Zhenbo Li,Jean-Yves Guillemaut,Wenwu Wang
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.
[AI-41] Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing
链接: https://arxiv.org/abs/2606.20087
作者: Kianoush Aqabakee,Leonardo Stella
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Additive manufacturing process optimization requires precise parameter control to minimize defects such as porosity. Traditional reinforcement learning (RL) approaches using discrete action spaces suffer from slow convergence and susceptibility to local optima, limiting their effectiveness for high-precision manufacturing tasks. This study addresses these limitations by employing a continuous action space combined with a novel architecture that integrates a multi-head attention mechanism with the Soft Actor-Critic (SAC) algorithm. The attention-based feature extractor enhances the agent’s ability to capture subtle variations in low-dimensional input features, enabling more effective exploration-exploitation balance for navigating value spaces with local minima. We validate our approach on porosity prediction and process parameter optimization in laser powder bed fusion, demonstrating faster convergence and higher final reward values compared to standard RL methods including DQN, PPO, TD3, and vanilla SAC. The proposed methodology achieves a convergence value of 322.79 within 14 episodes, outperforming existing approaches while maintaining stability throughout training.
[AI-42] Residual-Space Evolutionary Optimization via Flow-based Generative Models ICML2026
链接: https://arxiv.org/abs/2606.20084
作者: Zhuo Cao,Lena Krieger,Fernanda Nader,Xuan Zhao,Hanno Scharr,Ira Assent
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026 Workshop SPIGM, 5 pages, 3 figures
Abstract:Data editing with generative methods typically requires differentiable objectives and gradient-based search. However, these assumptions break down in flow-based settings, where edits are performed through forward and backward integration and often involve non-differentiable or black-box objectives. We introduce residual-space evolutionary optimization, a model-agnostic framework that addresses this gap by combining flow-based generative editing with evolutionary algorithms. Building on the observation that conditional flow matching (CFM) can disentangle condition-controlled factors from instance-specific residuals, our framework directly operates in residual space and separates two complementary search regimes: self-pollination performs local exploitation through feature-preserving residual refinement, and cross-pollination promotes broader exploration by recombining residuals across heterogeneous samples. As a proof of concept, we validate on MorphoMNIST, a benchmark dataset for counterfactual generation, and on crystal data, demonstrating that this exploration–exploitation decomposition provides a useful mechanism for balancing target alignment, instance preservation, and diversity, and extends beyond images to real-world scientific domains.
[AI-43] Process-Verified Reinforcement Learning for Theorem Proving via Lean
链接: https://arxiv.org/abs/2606.20068
作者: Minsu Kim,Se-Young Yun
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean’s elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.
[AI-44] Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale
链接: https://arxiv.org/abs/2606.20058
作者: Harsh Rao Dhanyamraju,Leonidas Raghav,Aaron Lee
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Enterprise AI aims to move toward continuous event monitoring, detection, and action across specialist agents, yet existing multi-agent systems largely assume discrete request-response workflows and remain underexplored at enterprise scale. We evaluate DAG Plan and Execute and ReAct across 208 production-derived enterprise scenarios spanning Persona (10 agents), Department (20-80), and Enterprise (200) scales, and introduce a Task Manager for continuous operation via priority inference, related-event merging, and preemption. Results show that scale, not task complexity, dominates orchestration performance: both architectures perform well at small scale but degrade at enterprise scale as agent discovery noise becomes the primary bottleneck, with simple tasks degrading more sharply than complex ones. DAG Plan and Execute offers higher precision and structured parallelization at smaller scales, but its higher overhead worsens at enterprise scale; ReAct is more robust by handling failures incrementally. The Task Manager reduces high-priority queue latency by 14-75% and improves related-event correctness by over 20 percentage points at enterprise scale.
[AI-45] A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillm ent Systems
链接: https://arxiv.org/abs/2606.20031
作者: Junzhe Xu,Zecui Zeng,Lusong Li,Yuetong Fang,Renjing Xu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281 \times energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.
[AI-46] Hierarchical Control in Multi-Agent Games: LLM -based Planning and RL Execution
链接: https://arxiv.org/abs/2606.20014
作者: Jannik Hösch,Alessandro Sestini,Florian Fuchs,Amir Baghi,Joakim Bergdahl,Konrad Tollmar,Jean-Philippe Barrette-LaPierre,Linus Gisslén
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures
Abstract:Reinforcement learning (RL) has achieved strong performance in sequential decision-making, yet scaling to complex multi-agent environments remains challenging due to sparse rewards, large state-action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low-level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and \emph``Flat’’ RL (end-to-end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand-crafted BT (46.4% vs 51.5% win rate, p=0.103 ) while both significantly outperform Flat RL trained without skill decomposition. A user study ( n=15 ) reveals that 60% of participants perceive LLM+RL agents as the most human-like ( p=0.027 ), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi-agent coordination and superior perceived believability without manual rule engineering.
[AI-47] StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation
链接: https://arxiv.org/abs/2606.20005
作者: Guangda Liu,Yiquan Wang,Chengwei Li,Wenhao Chen,Jing Lin,Yiwu Yao,Danning Ke,Wenchao Ding,Jieru Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Attention distillation, which trains one attention distribution to match another by minimizing their Kullback-Leibler (KL) divergence, is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training. However, existing approaches materialize both attention distributions before computing the KL reduction, incurring O(N_QN_K) memory and IO costs that become prohibitive at long context lengths. We present StreamKL, the first fused GPU primitive for attention KL divergence that eliminates this quadratic materialization. StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. We further design and implement efficient GPU kernels with dedicated optimizations. Experiments show StreamKL delivers up to 43\times and 14\times speedups over baseline methods in the forward and backward passes, respectively. Most importantly, StreamKL reduces the extra HBM footprint of attention distillation from O(N_QN_K) to O(1) , enabling long-context distillation on a single GPU.
[AI-48] Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agent ic Web Services ICML2026
链接: https://arxiv.org/abs/2606.19992
作者: Mugeng Liu,Shuoqi Li,Yixuan Zhang,Yun Ma
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain \emphstatic endpoints that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent’s tool intent as an \emphexecutable tool program that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once state-modifying calls, and a profile-driven policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing and evaluate it on diverse workflows of real-world applications. ToolPro reduces end-to-end latency by up to 53.4% and client-side traffic by up to 96.1%, with larger gains under higher network latency and workflow complexity.
[AI-49] Reward as An Agent for Embodied World Models
链接: https://arxiv.org/abs/2606.19990
作者: Pu Li,Zhigang Lin,Qiang Wu,Yongxuan Lv,Fei Wang,Shan You
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While RL has become a promising tool for refining world models, existing methods largely rely on conservative rollouts near the training distribution, limiting exploration, behavioral diversity, and richer dynamic discovery. In this work, we challenge this conservative paradigm. We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration. Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement. To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics. On the verification side, we introduce Reward as an Agent, an agentic reward framework that actively evaluates generated behaviors to provide robust reward signals and mitigate reward hacking under distribution shifts. On the exploration side, we introduce Dynamic-Aware Rollout Diversification through DynDiff-GRPO, which explicitly expands action-space exploration to diversify trajectories, broaden state-action coverage, and encourage richer embodied behaviors beyond conservative rollout regimes. By unifying Reward as an Agent with DynDiff-GRPO, we enable RL on a more reliable reward foundation with substantially diversified sampling, effectively mitigating reward hacking while yielding significant accuracy gains across multiple open-source world models, thereby demonstrating that broader exploration can scale successfully when grounded in robust verification.
[AI-50] ENPIRE: Agent ic Robot Policy Self-Improvement in the Real World
链接: https://arxiv.org/abs/2606.19980
作者: Wenli Xiao,Jia Xie,Tonghe Zhang,Haotian Lin,Letian “Max” Fu,Haoru Xue,Jalen Lu,Yi Yang,Cunxi Dai,Zi Wang,Jimmy Wu,Guanzhi Wang,S. Shankar Sastry,Ken Goldberg,Linxi “Jim” Fan,Yuke Zhu,Guanya Shi
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined in digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module ® to evaluate policies with one or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world manipulation learning into a controllable optimization procedure, minimizing human effort while allowing fair ablations across training recipe and agent variants. Powered by ENPIRE, frontier coding agents can autonomously train a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks, such as organizing a pin box, fastening a zip tie, and tool use, a process that further accelerates when we dispatch an agent team on a robot fleet. Our results suggest a practical and scalable path toward deploying coding agents to autonomously advancing robotics in the physical world.
[AI-51] he Algorithmic-Human Manager: AI Apps and Workers in the Indian Gig Economy
链接: https://arxiv.org/abs/2606.19975
作者: Omir Kumar,Krishnan Narayanan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Published by the Centre for Responsible AI (CeRAI) at IIT Madras
Abstract:This paper examines the impact of artificial intelligence and digital technologies on the blue-collar gig economy in India, focusing on algorithmic management. This paper examines the impact of artificial intelligence and digital technologies on the blue collar gig economy in India, focusing on algorithmic management he use of automated systems to allocate, monitor, and evaluate work in location-based services such as ride sharing and delivery. Using a social justice framework and a mixed-methods approach comprising interviews with 16 gig workers and 21 key stakeholders, the study uncovers a dual reality: while AI-powered systems expand access to work and generate operational efficiencies, they simultaneously introduce significant challenges related to fairness, transparency, and worker dignity. Key findings reveal that algorithmic systems are opaque by design, produce inequitable outcomes, and are not structured to reward additional labour with proportionate pay. The study advocates for a pragmatic hybrid governance model an Algorithmic Human Manager framework in which technological efficiency and human accountability operate together rather than in opposition. The findings carry implications for policymakers, platform companies, and civil society organizations working to design equitable AI governance frameworks for the gig economy in India and across the Global South.
[AI-52] Advancing DialNav through Automatic Embodied Dialog Augmentation
链接: https://arxiv.org/abs/2606.19948
作者: Leekyeung Han,Sangwon Jung,Hyunji Min,Jinseong Jeong,Minyoung Kim,Paul Hongsuck Seo
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 9 figures
Abstract:For embodied agents capable of physical interaction, the capability to create and understand dialog is crucial to ensure both safety and effectiveness. While DialNav~\citehan2025dialnav provides a framework for holistic evaluation of the dialog–execution loop in photorealistic indoor navigation, its performance remains limited by a critical scarcity of training data (2K episodes). To address this, we propose an automatic generation pipeline, and construct the \textbfRAINbow dataset, a large-scale training dataset with 238K episodes for DialNav. Our pipeline converts existing VLN datasets into multi-turn dialog and creates cost-efficient and high-quality dataset. Then, we introduce two additional complementary advances to unlock the data’s full potential: (1) Dual-Strategy Training, a navigation training scheme to align the navigation training with the dynamic dialog-navigation loop, and (2) a localization model that leverages VLN knowledge. By combining these complementary solutions, our model substantially outperforms the baseline in success rate on both \textbfVal Seen (58.24, \textbf+89%) and \textbfVal Unseen (29.05, \textbf+100%) splits, establishing a new state of the art.
[AI-53] PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation
链接: https://arxiv.org/abs/2606.19935
作者: Zhangzhao Liang,Xiaofen Xing,Mingyue Yang,Wenlve Zhou,Xiangmin Xu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Humanoid robots require co-speech motions that are not only expressive and speech-aligned, but also physically executable under embodiment constraints. Existing co-speech generation pipelines are predominantly human-centric: motions are first generated in human-body representations such as SMPL-X and subsequently retargeted to humanoid robots. In this work, we identify a fundamental embodiment gap in this paradigm, where the mismatch between human motion manifolds and humanoid embodiment constraints disrupts embodiment consistency during motion transfer and physical execution. Through extensive analysis, we show that although retargeting can preserve coarse motion semantics, it significantly compresses motion diversity and weakens prosody-motion synchronization, limiting expressive humanoid behaviors. To address this problem, we first propose IK-EER, a prosody-preserving humanoid motion curation framework that jointly optimizes kinematic feasibility and speech-motion temporal alignment during retargeting. Building upon the curated robot-native motion dataset, we further introduce PhysDrift, an embodiment-aware co-speech motion generation framework that directly predicts executable humanoid joint trajectories from speech without relying on intermediate human-body representations. Unlike conventional human-centric pipelines, PhysDrift maintains embodiment consistency throughout both training and inference while incorporating physical regularization to stabilize robot motion dynamics. Extensive experiments and real-world humanoid deployment demonstrate that embodiment-aware robot-native generation substantially improves speech-motion alignment, physical plausibility, motion smoothness, inference efficiency, and real-time interaction capability.
[AI-54] he Tao of Agency: Autotelic AI Embedded Agency and Dissolution of the Self
链接: https://arxiv.org/abs/2606.19924
作者: Aritra Sarkar
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Most artificial intelligence systems are built on the assumption that goals are exogenous and specified by the designer. Exploring what happens when an agent begins generating its own goals opens the field of autotelic AI. Agents are expected not merely to pursue objectives but to discover them. In this article, we trace its consequences through intrinsic motivation, resource-driven priors, causal-interventional learning, homeostasis, and embeddedness; the last of which is found to be a necessary but not sufficient condition for autotelic agency. Embeddedness individuates the agent at the cost of revealing that the individuation is non-unique, such that the same dynamics admit many valid partitions, each defining a different candidate self. The deepest problem with autotelic AI is therefore not how the agent generates goals, but how it generates and relativizes the self to which the goals are assigned. The agent must believe in its own boundary in order to act, and see through that boundary in order to understand. We consolidate these developments into a single framework and extend it along three directions: a quantum formulation in which the agent-environment cut becomes physical, a philosophical reading against non-dual contemplative traditions, and a concrete LLM-based agentic instantiation. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.19924 [cs.AI] (or arXiv:2606.19924v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.19924 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-55] CNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization
链接: https://arxiv.org/abs/2606.19921
作者: Shengbiao Lu,Xiaodong Wei
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This work proposes an element-based Convolutional Neural Network (CNN) to accelerate density-based Topology Optimization (TO), termed eCNNTO. TO generally undergoes a large number of iterations, where finite element analysis is performed in every iteration, leading to the efficiency bottleneck especially when dense meshes are used to achieve high-resolution designs. To address this limitation, eCNNTO is proposed to build upon Kallioras et al. (2020), where a Deep Belief Network (DBN) was trained for every element to predict its near-optimal density from its early history, thereby skipping the great majority of iterations and significantly accelerating the TO procedure. However, the method lacks spatial correlations among neighboring elements and may lead to disconnected features in the final structure. The proposed method employs CNN with residual connections to address this issue. On top of it, a novel training strategy is introduced to further enhance the optimization efficiency, where the training dataset consists of the final stage density histories rather than early ones. This change can also help reduce the required training data size. eCNNTO requires only a small dataset to train and yet it can be generalized to problems with largely different boundary conditions, loading cases, design domain geometries, mesh resolutions, as well as non-design domains. In the end, the generalization capabilities and efficiency of eCNNTO are demonstrated through a variety of examples in two and three dimensions, achieving up to 90% and 97% reduction of iterations, respectively.
[AI-56] Co-policy: Responsive Human-Robot Co-Creation for Musical Performances
链接: https://arxiv.org/abs/2606.19914
作者: Xuetao Li,Wenke Huang,Mang Ye,Zijian Liu,Jinhua Xie,Jifeng Xuan,Miao Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Art has long stood as a pivotal expression of human creativity. Embodied artificial intelligence offers a route for generative models to participate in that creativity through physical action rather than disembodied digital content. In robotic music co-creation, it is challenging to connect semantic musical understanding with real-time and physically executable performance. We present Co-policy, a framework for human-robot musical co-creation that separates semantic intent grounding, constrained musical variation, and visuomotor execution. To ground musical semantics, Co-policy uses pre-inference semantic anchors and a fine-tuned Qwen-vl planner (F-Qwen) to transform speech, live musical seeds, and visual observations into structured co-creation plans. To support low-latency execution, Co-policy introduces a Gaussian-Mixture Visuomotor Policy (GMP), implemented as a conditional mixture-density policy that maps target notes and visual context to multimodal robot actions in a single forward pass. Unlike robotic playback systems that merely reproduce user-specified notes, Co-policy generates complementary musical responses under both musical and physical constraints. Real-robot chime experiments, ablations, and expert evaluation show improved intent alignment, execution accuracy, and response frequency over diffusion-policy and ablated baselines, supporting physically grounded action generation as a key requirement for embodied human-AI co-creation.
[AI-57] Measuring Biological Capabilities and Risks of AI Agents
链接: https://arxiv.org/abs/2606.19899
作者: Patricia Paskov,Jeffrey Lee,Kyle Brady,Alyssa Worland
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations – drawing from our own evaluations – that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.
[AI-58] MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments
链接: https://arxiv.org/abs/2606.19893
作者: Wei Yu,Suxing Liu,Minjie Yu,Jiahao Wang,Zhijian Zheng,Haocheng Deng,Bing Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact-retrieval-only task designs, and the inefficiency of outcome-based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery-Oriented Tasks – including hypothesis generation and contradiction resolution – that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self-Reflective Meta-Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi-Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench-DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.
[AI-59] SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models
链接: https://arxiv.org/abs/2606.19888
作者: Feng Wu,Harsh Deep,Eric Lehman,Sanyam Kapoor,Guoshuai Zhao,Rahul Krishnan,Gari Clifford,Li-wei H Lehman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling long-sequence medical time series data, such as electrocardiograms (ECG), poses significant challenges due to high sampling rates, multichannel signal complexity, inherent noise, and limited labeled data. While recent self-supervised learning (SSL) methods, based on various encoder architectures such as convolutional neural networks, have been proposed to learn representations from unlabeled data, they often fall short in capturing long-range dependencies and noise-invariant features. Structured state space models (S4) excel at long-sequence modeling, but existing S4 architectures fail to capture the unique characteristics of multichannel physiological waveforms. In this work, we propose SL-S4Wave, a self-supervised learning framework that combines contrastive learning with a tailored encoder built on structured state space models. The encoder incorporates multi-layer global convolution using multiscale subkernels, enabling the capture of both fine-grained local patterns and long-range temporal dependencies in noisy, high-resolution multichannel waveforms. Extensive experiments on real-world datasets demonstrate that SL-S4Wave (1) consistently outperforms state-of-the-art supervised and self-supervised baselines in a challenging arrhythmia detection task, (2) achieves high performance with significantly fewer labeled examples, showcasing strong label efficiency, and (3) maintains robust performance on long waveform segments, highlighting its capacity to model complex temporal dynamics in long sequences that most existing approaches fail to efficiently model, and (4) transfers effectively to unseen arrhythmia types, underscoring its robust cross-domain generalization. We additionally evaluate SL-S4Wave on multiple EEG tasks, achieving superior performance over strong baselines, demonstrating generalizability of our approach beyond cardiac waveforms.
[AI-60] FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming
链接: https://arxiv.org/abs/2606.19887
作者: Chaeyun Kim,Daeyoung Park,Junghwan Kim,Jinyoung Jeong,Eunji Song,Yongtaek Lim,Minwoo Kim
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing safety benchmarks target general adversarial scenarios but miss finance-specific risks. Financial LLMs face regulatory compliance violations, fraud facilitation, and systemic trust erosion that require targeted evaluation. We introduce FinRED, an expert-guided red-teaming framework for financial LLM safety evaluation developed with financial experts. FinRED uses a novel two-level taxonomy mapping global standards (e.g., FATF and EU DORA) to threats ranging from regulatory evasion to complex fraud, integrated with a scalable pipeline that converts real financial documents into context-rich red-teaming Behavioral Prompts (seeds) through an expert-defined schema. Rigorous expert validation confirms seed plausibility and realism for meaningful LLM safety evaluation. We also provide an expert-validated, finance-specific rubric that goes beyond disclaimer checks, aligns more closely with human experts than static one-size-fits-all rubrics, and reduces critical false negatives from 28 to 12. Aligned with internationally adopted risk-management and information-security standards (e.g., ISO/IEC 27001), FinRED is deployed in South Korea’s Financial Security Institute (FSI) regulatory sandbox for generative AI security evaluation in real financial services. To mitigate dual-use risks, the dataset, generation pipeline, prompt template, and evaluation framework are gated for qualified researchers at this https URL and this https URL.
[AI-61] A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models
链接: https://arxiv.org/abs/2606.19868
作者: Jiayi Wang,Xu-Yao Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Although large language models (LLMs) have shown strong capabilities across a wide range of tasks, their outputs often remain unreliable and may contain hallucinations, making uncertainty estimation (UE) essential for building trustworthy LLMs. In practice, many mainstream LLMs are only accessible through restricted APIs, where internal signals such as logits and hidden states are unavailable, making black-box UE especially important. However, existing work on black-box UE for LLMs remains fragmented in methodology and lacks a unified empirical comparison. To address this gap, we present a systematic review of black-box UE methods and organize them into five categories: verbalization-based, sampling-based, explanation-based, multi-agent, and hybrid methods. We further build a unified evaluation framework and benchmark 24 representative methods across 4 models and 4 dataset settings. Our results show that no single method consistently dominates across all settings. Nevertheless, methods that reason over and compare candidates in the answer space are generally effective, and hybrid methods that combine multiple uncertainty signals perform well under most conditions. By releasing the benchmark data and a unified evaluation framework, we aim to facilitate reproducible comparisons and support future research, while our empirical findings provide practical guidance for developing future black-box UE methods for LLMs.
[AI-62] Neural Additive and Basis Models with Feature Selection and Interactions PAKDD2024
链接: https://arxiv.org/abs/2606.19850
作者: Yasutoshi Kishimoto,Kota Yamanishi,Takuya Matsuda,Shinichi Shirakawa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at PAKDD 2024. Code is available at this https URL
Abstract:Deep neural networks (DNNs) exhibit attractive performance in various fields but often suffer from low interpretability. The neural additive model (NAM) and its variant called the neural basis model (NBM) use neural networks (NNs) as nonlinear shape functions in generalized additive models (GAMs). Both models are highly interpretable and exhibit good performance and flexibility for NN training. NAM and NBM can provide and visualize the contribution of each feature to the prediction owing to GAM-based architectures. However, when using two-input NNs to consider feature interactions or when applying them to high-dimensional datasets, training NAM and NBM becomes intractable due to the increase in the computational resources required. This paper proposes incorporating the feature selection mechanism into NAM and NBM to resolve computational bottlenecks. We introduce the feature selection layer in both models and update the selection weights during training. Our method is simple and can reduce computational costs and model sizes compared to vanilla NAM and NBM. In addition, it enables us to use two-input NNs even in high-dimensional datasets and capture feature interactions. We demonstrate that the proposed models are computationally efficient compared to vanilla NAM and NBM, and they exhibit better or comparable performance with state-of-the-art GAMs.
[AI-63] When Where and How: Adaptive Binning for Tabular Self-Supervised Learning MICCAI2026
链接: https://arxiv.org/abs/2606.19827
作者: Daehwan Kim,Haejun Chung,Ikbeom Jang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to MICCAI 2026
Abstract:Medical tabular data are ubiquitous in clinical research, but deep learning for tables remains underexplored because reliable labels often require costly expert adjudication, even though structured clinical variables are routinely available in tabular form. Self-supervised learning can leverage these unlabeled tables, and recent binning-based pretexts offer a promising inductive bias, but existing objectives fix a single global quantile discretization and apply feature-agnostic supervision. We propose Adaptive Binning, a training-adaptive discretization pretext for tabular SSL that couples discretization to learning through a feature-wise coarse-to-fine curriculum. Motivated by the spectral bias of neural networks and the principles of curriculum learning, our method progressively refines discretization per feature upon plateau detection and selects representation-aware splits to jointly improve value-space concentration and representation-space coherence. A heterogeneity-aware objective unifies categorical reconstruction with ordinal supervision for numerical features, and experiments on public medical tabular datasets under unified evaluation protocols show consistent gains for linear probing and fine-tuning without dataset-specific discretization tuning. We further introduce a medical tabular SSL benchmark with standardized protocols to support reproducible progress in this underexplored domain. Our code is available at this https URL.
[AI-64] coAgent : A Scalable 5G Multi-KPM Forecasting With 3GPP-Grounded Explainability
链接: https://arxiv.org/abs/2606.19821
作者: Geon Kim,Dara Ron,Sukhdeep Singh,Suyog Moogi,Pranshav Gajjar,V V N K Someswara Rao Koduri,Een Kee Hong,Vijay K. Shah
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 6 figures. Submitted to IEEE GLOBECOM 2026
Abstract:Key Performance Measurement (KPM) forecasting is essential for proactive network management of 5G and next-generation telecom networks. However, existing machine learning (ML) approaches face significant limitations in scalability and explainability, restricting their effectiveness in real-world deployments. We propose TelcoAgent, a foundation model-based framework that enables accurate, scalable, and explainable forecasting of multiple KPMs across diverse network cells without the need for site-specific training. Specifically, the framework comprises three key components: (i) an automated three-agent pipeline that constructs a 3rd Generation Partnership Project (3GPP) knowledge graph directly from specification documents, (ii) a scalable, time-series foundation model (TSFM)-based prediction pipeline to deliver accurate, zero-shot forecasting, and finally (iii) a reasoning and explanation pipeline that provides actionable, domain-grounded diagnostics. Evaluated using a 3-month, real-world, city-scale 5G KPM dataset from a U.S.-based network operator, TelcoAgent demonstrates high forecasting accuracy for all 7 considered KPMs per cell across 200 cells, while delivering explainable insights and actionable instructions to address network degradations.
[AI-65] Uncertainty-Aware Reward Modeling for Stable RLHF
链接: https://arxiv.org/abs/2606.19818
作者: Licheng Pan,Haocheng Yang,Haoxuan Li,Yichen Sun,Yunsheng Lu,Shijian Wang,Lei Shen,Yuan Lu,Zhixuan Chu,Hao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning from human feedback (RLHF) aligns large language models by training reward models on preference data and optimizing policies to maximize predicted rewards. However, this pipeline faces two fundamental challenges: (1) reward models cannot signal when their predictions are unreliable, since they usually act as deterministic point estimators; and (2) modern group-based policy optimization can amplify unreliable reward signals, as exemplified by GRPO’s uniform treatment of rewards during advantage computation. As policies explore increasingly diverse responses, these two limitations create a critical vulnerability: unreliable reward estimates may be granted disproportionate influence, triggering severe reward hacking. We propose Uncertainty-Aware Reward Modeling (UARM), which equips reward models with calibrated uncertainty via quantile-based conformal prediction and reweights GRPO advantages through heteroscedastic variance decomposition. Experiments across HelpSteer, UltraFeedback, and PKU-SafeRLHF demonstrate that UARM significantly improves reward model calibration, reduces reward hacking, and enhances downstream alignment quality compared to standard GRPO and uncertainty-agnostic baselines.
[AI-66] Human-on-the-Loop Orchestration for AI-Assisted Legal Discovery
链接: https://arxiv.org/abs/2606.19812
作者: Anushree Sinha,Srivaths Ranganathan,Abhishek Dharmaratnakar,Debanshu Das
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Autonomous Large Language Model (LLM) agents are increasingly deployed in electronic discovery (e-discovery), where compounding errors across multi-step reasoning chains can constitute legal malpractice. Unlike single-turn retrieval, agentic workflows operating over privileged document corpora exhibit a class of failure we term “trajectory collapse”: an early misclassification silently propagates, rendering an entire privilege review invalid. This paper makes three contributions. First, we propose a structured taxonomy of agentic failures in legal information retrieval, organized by functional stage. Second, we introduce a four-layer verification architecture – spanning planning, reasoning, execution, and uncertainty quantification – designed to intercept these failures before they compound. Third, we present a preliminary simulation study on a synthetic e-discovery corpus that demonstrates how mandatory Human-on-the-Loop (HOTL) escalation thresholds reduce privilege-waiver risk relative to fully autonomous baselines. Our results suggest that calibrated uncertainty thresholds can reduce privilege-waiver risk by up to 61% versus fully autonomous deployment, while routing fewer than one quarter of documents to attorney review.
[AI-67] Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases SIGMOD2026
链接: https://arxiv.org/abs/2606.19803
作者: Lakshmi Sahithi Yalamarthi,Primal Pappachan
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at SeQureDB 26, Sigmod 2026
Abstract:Vector databases are increasingly used in security sensitive contexts with Retrieval Augmented Generation and organizational AI pipelines; however, their security capabilities remain limited. Specifically, Fine-grained Access Control (FGAC) which is required to ensure that data access adheres to user-specific policies is not fully supported in modern vector databases. Unlike relational databases, vector databases combine structured and unstructured attributes to provide semantic, approximate query results, which complicates FGAC implementation. This creates an inherent tension between enforcing FGAC policies correctly, achieving high ANN search recall and maintaining low query latency. In this paper, we present a vision for Policy-aware Vector Search by formalizing the FGAC policy model in vector databases as well as the enforcement problem. We compare various enforcement strategies, present preliminary findings, and identify key open challenges for future research in policy-aware vector search.
[AI-68] Agent ic Electronic Design Automation: A Handoff Perspective
链接: https://arxiv.org/abs/2606.19795
作者: Jiawei Liu,Peiyi Han,Yuntao Lu,Su Zheng,Fengyu Yan,Bei Yu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Electronic design automation (EDA) is inherently multi-stage and handoff-heavy. Design artifacts, flow scripts, and engineering decisions cross tool, session, and organizational boundaries before final implementation, signoff, or release. Each transfer carries explicit and implicit requirements that may not be fully captured by stage-local checks. LLM-based agents now invoke EDA tools directly, embed retrieved knowledge in executable scripts, and hand off state across sessions and stages. Once their outputs condition downstream engineering decisions, the transferred object must satisfy a handoff contract and meet the assumptions of its next consumer. This survey introduces handoff validity as its organizing principle. A handoff is valid when the transferred object satisfies the consumer’s acceptance conditions and carries sufficient context, evidence, and provenance for downstream use. We review 82 systems and classify them into three boundary classes. Stage-Bound systems establish validity within a single EDA stage or bounded verification task. Flow-Bound systems preserve coherent workflow state across tools, invocations, and sessions. Organization-Bound systems maintain source grounding, provenance, scope, and admissibility across knowledge and authority boundaries. For each class, we analyze handoff contracts, handoff objects, coordination mechanisms, and open questions. These analyses motivate a five-layer EDA agent communication protocol (EACP), covering the agent discovery, agent message, tool invocation, workflow orchestration, and security and IP protocols. We aim to provide a common vocabulary and research agenda for trustworthy agentic EDA.
[AI-69] ORAg entBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
链接: https://arxiv.org/abs/2606.19787
作者: Jiajun Li,Mingshu Cai,Yixuan Li,Yu Ding,Ran Hou,Guanyu Nie,Xiongwei Han,Wanyuan Wang
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, preprint, v1
Abstract:Large language models are increasingly deployed as autonomous agents for multi-step tasks in executable environments, yet their ability to perform realistic operations research (OR) work remains unclear. Existing OR evaluations often decouple modeling from solving, rely on pre-formalized or text-only instances, and rarely test the full workflow from operational artifacts to validated decisions. In this work, we introduce ORAgentBench, an execution-grounded benchmark for evaluating autonomous agents on challenging end-to-end operations research tasks. It contains 107 human-reviewed tasks across diverse operational scenarios, each packaged in an isolated environment with a natural-language brief, multi-file data, configuration artifacts, and a required submission schema. Agents must write and run solution code, and their submissions are evaluated by hidden validators for schema validity, hard-constraint feasibility, and normalized objective quality. Experiments with fourteen frontier agent-model configurations show that current agents remain far from reliable OR practice. The best agent passes only 35.51% of all tasks and 20.59% of hard tasks, and many feasible submissions still fall below the required quality threshold. Failure analysis further shows that errors are dominated by strategic weaknesses, including missed operational rules, brittle formulations, weak feasible-solution construction, and insufficient solution improvement. OR-specific procedural skills increase hard-task feasibility, but do not reliably improve solution quality or pass rate. These results suggest that progress in OR agents requires moving beyond plausible optimization code toward dependable, high-quality operational decision-making.
[AI-70] Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning
链接: https://arxiv.org/abs/2606.19771
作者: Xuanzhi Feng,Zhengyang Li,Zeyu Liu,Haoxi Li,Yuming Jiang,Bing Guo,Jingcai Guo,Jie Zhang,Song Guo
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.
[AI-71] Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI
链接: https://arxiv.org/abs/2606.19769
作者: Shaoshan Liu,Xiugong Qin,Xuan Wu,Xuan Xia,Ning Ding,Jialu Liu,Jie Tang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The scalability of humanoid robots will depend not only on models and hardware, but also on whether physical experience can accumulate across robots, tasks, organizations, and time. Drawing on the authors’ work in developing ISO/WD 26264-1, Humanoid robot datasets – Part 1: General requirements, within ISO/TC 299/WG 16, this article argues that data standards are becoming foundational infrastructure for Physical AI. We develop three insights. First, humanoid robot data is embodied interaction data, not a collection of isolated digital samples; a useful dataset must preserve the relationship among robot body, action, task, scene, execution trace, and outcome. Second, its value depends on physical coherence: multimodal streams are reusable only when timing, coordinate frames, calibration, kinematics, units, and synchronization assumptions remain inspectable. Third, the main bottleneck is not only data scarcity, but non-cumulative data caused by high collection costs, data silos, and inconsistent evaluation. We argue that humanoid robot data standards address these bottlenecks by making embodied experience interpretable, shareable, traceable, and reusable. A general standard should provide horizontal infrastructure for lifecycle management, metadata, provenance, quality, versioning, and traceability, while capability-specific parts should define domain grammar for manipulation, locomotion, human-robot interaction, cognition, and future humanoid capabilities. As AI moves from screens into bodies, data standards must evolve from organizing digital information to structuring physical interaction.
[AI-72] Optimal Scheduling in a Question-Answering Forum of Knowledge Workers
链接: https://arxiv.org/abs/2606.19759
作者: Rohit Negi,Mustafa Yilmaz
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 14 pages, 4 figures
Abstract:As individuals turn to the Internet to find answers to questions they may have, several Question Answering (QA) forums have evolved, where users knowledgeable in certain topics can contribute their expertise to answering these requests for information. While these are currently volunteer based, we consider a future version employing knowledge workers who are experts in certain topics. In such a system, the request-answer processes forming the queuing system may utilize schedulers that assign requests in different topics to the experts in the forum, who may be able to answer them according to their expertise levels in different topics. With this model, we calculate the capacity of the system for handling the requests while keeping the system stable, and design schedulers that achieve capacity. We also investigate how collaboration between experts in answering requests can potentially increase capacity.
[AI-73] SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling
链接: https://arxiv.org/abs/2606.19755
作者: Haotian Xu,Zeyang Zhang,Linbao Li,Huadi Zheng,Yu Li,Cheng Zhuo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.
[AI-74] Grounded Inference: Principles for Deterministically Encapsulated Generative Models
链接: https://arxiv.org/abs/2606.19753
作者: Marty O’Neill
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages, 3 figures
Abstract:The incorporation of generative models into traditional computational systems presents both enormous opportunity and tremendous peril. Although many early adopters have realized these perils at great expense, the field still requires foundational frameworks to de-risk incorporation of AI into traditional systems. This manuscript establishes this foundation through the definition of four specific primitives of AI blended architecture, designed to enable deterministic encapsulation of probabilistic models. It further establishes two overarching anti-patterns broadly represented across industry to serve as warnings for engineers in this field. This framework was designed to enable successful integration of AI into traditional systems while providing a foundation upon which generative model providers could build the next generation of generative model interfaces.
[AI-75] mporal Self-Imitation Learning
链接: https://arxiv.org/abs/2606.19752
作者: Yinsen Jia,Boyuan Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.
[AI-76] A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations Label Formats and Dataset Composition
链接: https://arxiv.org/abs/2606.19747
作者: Nabil Mosharraf Hossain(1),Riasat Islam(1 and 2),Unaizah Obaidellah(3) ((1) Greentech Apps Foundation, United Kingdom, (2) Queen Mary University of London, United Kingdom, (3) University of Malaya, Malaysia)
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 9 figures, 5 tables, Submitted to International Journal of Speech Technology
Abstract:Quran Automatic Speech Recognition (ASR) aims to convert Quranic recitation into text, enabling applications such as aided memorisation tools and Quranic search engines. However, existing ASR models often exhibit high Word Error Rates (WER) on user-recited verses and lack full coverage of the Quranic corpus. This paper presents a systematic empirical study of domain-specific fine-tuning of pretrained Transformer-based models for Quranic ASR, using advanced speech feature extraction methods: Wav2Vec2.0, HuBERT, and XLS-R. These models apply self-supervised learning by masking portions of input audio and using Transformer architectures to learn context-aware speech features. The pretrained models are fine-tuned on a filtered Quranic dataset exceeding 870 hours of professional and user recitations. Through comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations, we identify the key factors that affect transcription accuracy in this domain. Our best-performing configuration achieves a WER of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting, representing roughly a five-percentage-point gain over the Citrinet baseline (WER = 0.163) while reducing combined-model training time from 140 hours to 40 hours. Arabic text without diacritics yields the best fine-tuning results, and Wav2Vec2-XLSR-53 provides the strongest overall representation. Future work includes improving dataset quality and developing phoneme-aware models to extract deeper speech feature representations for Tajweed-sensitive applications.
[AI-77] Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks
链接: https://arxiv.org/abs/2606.19741
作者: Haocheng Duan,Yuxin Guo,Jieyi Bi,Anqi Xie,Sirui Li,Yining Ma,Cathy Wu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review
Abstract:Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program’s per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB’s effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.
[AI-78] VOiLA: Vectorized Online Planning with Learned Diffusion Model for POMDP Agents
链接: https://arxiv.org/abs/2606.19729
作者: Marcus Hoerger,Rishikesh Joshi,Rahul Shome,Ian Manchester,Hanna Kurniawati
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted to the 2026 International Symposium of Robotics Research (ISRR)
Abstract:Planning under uncertainty is an essential capability for autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for such a capability. Although POMDP-based planning has advanced significantly, its application to real-world problems is often limited by the difficulty of obtaining faithful POMDP models. We present Vectorized Online planning wIth Learned diffusion model for POMDP Agents (VOiLA), a framework that learns task-agnostic POMDP models for online planning under uncertainty. VOiLA learns transition and observation samplers using conditional diffusion models and learns observation-likelihood models for particle-based belief updates. To enable efficient online planning, the diffusion samplers are distilled into compact feedforward generators and integrated with Vectorized Online POMDP Planner (VOPP), an online POMDP planner designed to leverage GPU parallelization. Experimental results indicate the distillation strategy reduces sampling cost by up to nearly three orders of magnitude, making learned generative POMDP models practical for online planning. Evaluation of VOiLA on three benchmark problems indicate that VOiLA achieves equal or better performance than Recurrent Soft Actor Critic while using less than 10% training data, and generalizes much better to unseen environment configurations. Physical robot evaluation indicates VOiLA uses the models learned using only simulated data and generates a policy that successfully accomplish the task in 10 of 10 runs.
[AI-79] Bidirectional Tutoring for Developmental Motor Learning in Robots: Co-Developed Interaction Dynamics Support Stable Learning
链接: https://arxiv.org/abs/2606.19728
作者: Rui Fukushima,Jun Tani
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages, 14 figures
Abstract:Infants are well known to develop their motor skills through dense interaction with caregivers. Although such social interaction is crucial for human development, motor-skill learning in robots is often treated as a unidirectional process in which robots passively receive demonstrations from tutors. This overlooks a key property of social interaction: it is inherently bidirectional, with tutor and learner dynamically adapting to each other. In such interactions, the robot’s past experiences may function as prior constraints that shape the dynamics of their co-developed trajectories. We hypothesize that bidirectional tutoring allows such constraints to guide the formation of consistent behavioral patterns that preserve behavioral coherence and support generalization, whereas unidirectional interaction lacks such constraints and leads to broader, less consistent behavioral patterns. To examine this hypothesis, we conducted two experiments with a physical humanoid robot performing an object manipulation task: one involving human-robot interaction and another employing an AI tutor interacting with the real robot through an adaptive intervention mechanism designed to examine whether similar effects would emerge under more controlled conditions. We implement the developmental learning framework using a free-energy-principle-based neural network extended with generative replay, which supports stable sequence-by-sequence learning from single tutored episodes. Across both settings, bidirectional tutoring fostered consistent behaviors and stage-wise generalization, while the robot gradually required less tutor guidance. These results suggest that bidirectional tutoring, as an embodied and socially grounded approach, provides an effective scaffold for developmental motor learning in robots.
[AI-80] OnDeFog: Online Decision Transformer under Frame Dropping PRICAI2025
链接: https://arxiv.org/abs/2606.19721
作者: Daiki Yotsufuji,Kenta Nishihara,Shoma Shimizu,Kento Uchida,Shinichi Shirakawa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to PRICAI 2025
Abstract:In challenging real-world reinforcement learning applications, communication delays or sensor failures often cause frame dropping, in which the agent cannot receive the dropped states and associated rewards. To address the performance degradation caused by frame dropping, the Decision Transformer under Random Frame Dropping (DeFog) was developed by incorporating additional mechanisms into the decision transformer to tackle frame dropping. Although DeFog can mitigate performance degradation in frame-dropping environments, since DeFog is an offline learning method, it struggles to effectively generalize to novel states not adequately represented in the training dataset. In this study, we propose OnDeFog, which integrates the mechanisms in DeFog with the online decision transformer (ODT), an online reinforcement learning method that learns policies through direct environmental interaction. Comprehensive experimental evaluation demonstrates that our proposed OnDeFog achieves superior performance compared to ODT in environments characterized by high dropping frame rate and outperforms DeFog on datasets containing a large amount of low-reward data.
[AI-81] Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
链接: https://arxiv.org/abs/2606.19704
作者: Dhaval C. Patel,Kaoutar El Maghraoui,Shuxin Lin,Yusheng Li,Tianjun Feng,Chun-Yi Tsai,Yihan Sun,Wei Alexander Xin,Akshat Bhandari,Tanisha Rathod,Aaron Fan,Sanskruti Vijay Shejwal,Tomas Pasiecznik,Sagar Chethan Kumar,Tanmay Agarwal,Rohith Kanathur,Sam Colman,Amaan Sheikh,Dev Bahl,Ann Li,Krish Veera,Alimurtaza Mustafa Merchant,Shambhawi Baswaraj Bhure,Sajal Kumar Goyla,Chengrui Li,Kirthana Natarajan,Rui Li,Thomas Ajai,Rujing Li,Vivek G. Iyer,Sanjaii Vijayakumar,Yitong Bai,Ayal Yakobe,Darief Maes,Yassine Jebbouri,Tianyang Xu,Thai Quoc On,Vera Mazeeva,Winston Li,Yuval Shemla,Yeshitha Bhuvanesh,Rushin Bhatt,Siddharth Chethan Gowda,Alisha Vinod,Caroline Cahill,Shriya Aishani Rachakonda,Yunfeng Chen,Aryaman Agrawal,Aman Upganlawar,Mao Le Jonathan Ang,Yubin Sally Go,Madhav Rajkondawar,Yang-Jung Chen,Trisha Maturi,Ananya Kapoor,Andrew Li,Shrey Arora,Mana Abbaszadeh,Shen Li,Charles Xu,Byeolah Kwon
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 2 tables, 5 figures
Abstract:Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.
[AI-82] LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing
链接: https://arxiv.org/abs/2606.19679
作者: Masih Eskandar,Miquel Sirera Perelló,Stratis Ioannidis,Jennifer Dy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Lifelong knowledge editing aims to efficiently and sequentially update language models over time, as new knowledge becomes available or when the model makes mistakes, while preserving acceptable performance on past knowledge. One unresolved challenge is that existing methods modify a fixed set of layers for all new knowledge samples, reducing flexibility and increasing catastrophic forgetting. Another is requiring access to previous knowledge and extensive pre-processing to obtain data statistics. To address these challenges, we introduce LOKI, a novel approach that uses dynamic layer selection based on the Hilbert-Schmidt Independence Criterion and projects gradient updates onto the null-space of the model weights, bypassing the requirement for previous knowledge access. We show that LOKI achieves superior performance to existing approaches across a wide variety of experiments, achieving up to a 14% improvement in average accuracy.
[AI-83] Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation
链接: https://arxiv.org/abs/2606.19636
作者: Luca Zhou,Sajel Shah,Emanuele Rodolà,Roberto Dessì
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages of main paper, 4 figures and 5 tables in the main paper, with more in the appendix
Abstract:Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic curricula, and verifier training. We show this proxy has a persistent blind spot on its hardest stratum: on the eight free-form math cells we test (GSM8K and MATH across four open-weight models), 10.3-22.9% of the examples that no sampling seed solves in six tries are instead solved at matched compute by a six-chain deterministic regime. These are greedy decoding plus five cheap residual-stream perturbations applied via activation grafting, while greedy alone solves at most 6% on these math cells. Recovery scales with the additional budget, across perturbations whose mechanistic distinctness we verify across all twelve cells (cross-kind fix-set Jaccard = 0.47 in every setup). Activation grafting is used as an intervention on internal representations, not a decoding method; we use it purely as a diagnostic and diversification tool, and our recovered items show that the pass@k= 0 % stratum is structurally identifiable in the residual stream rather than that the unmodified model reaches them under ordinary inference.
[AI-84] CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion
链接: https://arxiv.org/abs/2606.19633
作者: Francisco Affonso,Matheus P. Angarola,Ana Luiza Mineiro,Aditya Potnis,Marcelo Becker,Girish Chowdhary
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Perceptive legged locomotion over discontinuous terrain (e.g., stairs, gaps, and obstacles) requires adaptive behavior, as a single conservative gait cannot produce the anticipatory maneuvers needed for abrupt topology changes. Cast as multi-task reinforcement learning, this problem introduces a tension between sharing and separation. Tasks use a common locomotion base but have conflicting rewards, so a policy must share behavior while avoiding value interference. Prior work addresses only one side, with monolithic policies sacrificing specialization and hierarchical sub-policies sacrificing generalization across transitions and unseen terrain. We propose CTS-MoE, which combines a dense mixture-of-experts actor with perception-based gating to compose shared behaviors and a multi-critic with task-specific value heads to prevent interference. The model is trained end-to-end in a single-stage concurrent teacher-student setup that handles partial observability and avoids sequential distillation, with task labels used only during training. At deployment, routing depends solely on perception, allowing terrain adaptation without a high-level selector or terrain classifier. Experiments on a Unitree Go1 in simulation and on hardware across seen and unseen terrains show task-aware specialization, with lower tracking error and higher success rates than monolithic baselines. Project Website: this https URL .
[AI-85] AI4SE and SE4AI Exploration: A Decade Looking Back and Forward
链接: https://arxiv.org/abs/2606.19630
作者: H. Sinan Bank,Daniel R. Herber,Thomas Bradley
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Systems and Control (eess.SY)
备注: 10 pages, 5 figure
Abstract:The March 2020 INCOSE INSIGHT special issue on AI and Systems Engineering (SE) became the most downloaded issue in the publication’s history and launched a research community that now draws over 250 registrants to its annual workshop. In this article, we trace the progress in AI and SE across three phases (labeled here foundational, applied, and LLM inflection) based on the authors’ reading of the field’s core papers, and describe our opinions of where the community has converged and where critical gaps remain. Separately, a human-AI agreement literature review leveraging both human expertise and six AI models was performed to assess the relevance of 1,712 INCOSE INSIGHT articles and 889 SERC publications. The results identify five critical research gaps and offer guidance for practitioners navigating AI adoption, assurance, and workforce transformation in SE. We share the agreement data and the AI4SE/SE4AI Explorer web application so readers can compare their own relevance judgments with the human and AI raters.
[AI-86] RIVET: Robust Idempotent Voice Attribute Editing
链接: https://arxiv.org/abs/2606.19629
作者: Dareen Alharthi,Bhuvan Koduru,Rita Singh,Bhiksha Raj
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Voice attribute editing models modify characteristics such as age and gender while preserving speaker identity. In large-scale speech datasets, however, attribute annotations are often noisy or inconsistent, which can cause conditional generative models to produce unstable edits. In this work, we show that idempotency provides an effective mechanism for improving robustness to noisy labels. An idempotent operator is one for which repeated application does not change the result, i.e., f(f(x)) = f(x). Enforcing this property acts as an implicit regularizer that reduces sensitivity to mislabeled examples. We introduce RIVET, a training framework that incorporates an idempotency objective to improve robustness to label noise. We evaluate RIVET under controlled label noise and on the GLOBE dataset with naturally noisy annotations. RIVET improves editing success and better preserves speaker identity than standard training, showing that idempotency improves robustness in voice editing models.
[AI-87] StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns
链接: https://arxiv.org/abs/2606.19613
作者: Vlad Sobal,Shuo Yang,Yuting Zhang,Wei Xia,Stefano Soatto
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: this http URL.
[AI-88] Latent Confounded Causal Discovery via Lie Bracket Geometry
链接: https://arxiv.org/abs/2606.19610
作者: Sridhar Mahadevan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39 pages
Abstract:Recent work on Kan-Do-Calculus (KDC) has established that the boundary between passive observation and active intervention in causal inference is a category-theoretic bi-adjunction, with interventions modeled by left Kan extensions and conditioning by right Kan extensions. This paper introduces two causal discovery algorithms under latent confounding, building on the information-geometric and categorical consequences of KDC. In smooth statistical settings, Radon-Nikodym derivatives between observational and interventional measures induce local causal vector fields; failures of these fields to close under Lie brackets become computable Frobenius residuals, which we interpret as witnesses of failed visible integrability and possible latent or unmodeled structure. Our first algorithm, BRIDGE (Bracket Residuals for Interventional Discovery and Geometric Estimation), combines an interventional density or Radon-Nikodym-ratio engine with a geometric screen that proposes a high-recall family of admissible arrows, identifies non-closing visible pairs as latent-obstruction candidates, and passes the reduced family to downstream score-based or differentiable discovery routines. The second algorithmic contribution, Spectral Kan-Do Flow Matching (SKFM), learns amortized intervention fields and factors latent curvature spectrally, exposing the direct Lie-space endpoint toward which BRIDGE points. A detailed set of experiments show that both algorithms are capable of discovering causal models with latent confounders while collapsing the super-exponential space of possible DAGs by many orders of magnitude. This paper introduces a new paradigm in causal discovery, where latent structure is inferred directly from the geometry of intervention-induced flows.
[AI-89] Which Pairs to Compare for LLM Post-Training?
链接: https://arxiv.org/abs/2606.19607
作者: Jiangze Han,Vineet Goyal,Will Ma
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.
[AI-90] FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines
链接: https://arxiv.org/abs/2606.19605
作者: Paul Kassianik,Baturay Saglam,Huaibo Zhao,Blaine Nelson,Supriti Vijay,Aman Priyanshu,Amin Karbasi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean \pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.
[AI-91] Configurable Clinical Information Extraction with Agent ic RAG : What Works What Breaks and Why
链接: https://arxiv.org/abs/2606.19602
作者: Osman Alperen Çinar-Koraş,Marie Bauer,Sameh Khattab,Merlin Engelke,Moon Kim,Stephan Settelmeier,Shigeyasu Sugawara,Fabian Freisleben,Felix Nensa,Jens Kleesiek
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Patient contexts span hundreds of heterogeneous documents and thousands of structured data points, yet the document-level metadata that AI systems need for retrieval and triage is absent or incomplete. Standard retrieval-augmented generation fails on this data, mishandling temporal reasoning, cross-document dependencies, and missing metadata. We deploy ACIE (Agentic Clinical Information Extraction) at University Medicine Essen: an on-premise agentic RAG pipeline that reasons over complete patient contexts and grounds every answer in source passages for clinician verification. We quantify the metadata gap, trace the architectural decisions it shaped, and evaluate extraction alongside an independent retrospective lymphoma registry study, in which nuclear-medicine physicians verify every extracted value against its cited sources. Across 7,326 judgments, clinicians accepted 96.5% of extractions, with per-type acceptance ranging from 80% to 99%.
[AI-92] PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets INTERSPEECH2026
链接: https://arxiv.org/abs/2606.19597
作者: Junyi Fan,Donald S. Williamson
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to INTERSPEECH 2026
Abstract:Mean opinion scores (MOS) are widely used for speech quality assessment, yet scalar labels are sensitive to rater variability and listening test differences. This introduces labeling noise, which limits the reliability of MOS prediction. Preference prediction reduces this variability as listeners compare signals directly, producing cleaner labels. We study MOS-free preference prediction and propose PrefSQA, which incorporates uncertainty-aware logits, an impairment attention head, and a module based on non-matching-reference comparisons. We use and refine five datasets, including MOS-derived and low-noise simulated sets with matching and non-matching content, experiment with human preference sets, and test on unseen data. Experiments show small improvements on MOS-derived data, while other sets reveal clear improvement over the baselines, highlighting the value of high-quality preference data and demonstrating the effectiveness of the proposed method.
[AI-93] IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows
链接: https://arxiv.org/abs/2606.19595
作者: Ahmad Salimi,Wentao Ma,Yuzhi Tang,Dongming Shen,Mu Li,Alex Smola
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for speech-capable models focus on the timing of interruptions: barge-in detection, endpointing, and turn-taking dynamics. They leave unmeasured what happens after the interruption: does the agent resume the workflow at the correct step? Does it address the user’s interjection? Does it avoid re-delivering content the user already heard? We introduce IHBench (Interruption Handling Benchmark), a benchmark that evaluates post-interruption recovery in voice agents executing state-machine-driven workflows across 10 enterprise domains. Six interruption types are injected at controlled points mid-utterance, with per-interruption evaluation rubrics generated alongside the data. Each interruption is scored on two axes: task fulfillment and recovery quality. We evaluate 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Models vary widely, and recovery quality depends strongly on the interruption type. Across our experiments, closed-weight models are consistently more robust to interruptions than open-weight ones: they win far more often on task fulfillment, degrade roughly 3.3x more slowly as conversations grow longer, and show no audio-versus-text modality gap, whereas the open-weight models lose ground on all three. A human study validates the LLM judge against human annotators, and a cross-benchmark analysis against AudioMultiChallenge indicates that recovery quality is a largely distinct capability axis. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.19595 [cs.LG] (or arXiv:2606.19595v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.19595 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-94] Analyzing the Narration Gap in LLM -Solver Loops
链接: https://arxiv.org/abs/2606.19588
作者: Zunchen Huang,Songgaojun Deng
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Logic in Computer Science (cs.LO)
备注:
Abstract:Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool’s output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.
[AI-95] FlowFake: Liquid Networks for Audio Deepfake Detection ICML2026
链接: https://arxiv.org/abs/2606.19579
作者: Shivaay Dhondiyal,Divyansh Sharma,Dinesh Kumar Vishwakarma
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at the Workshop on Learning to Listen: Machine Learning for Audio at ICML 2026
Abstract:Audio deepfakes generated by neural text-to-speech and voice-cloning systems threaten speaker verification and public discourse at scale. The core challenge is cross-dataset generalization: detectors trained on one synthesis pipeline collapse on unseen forgeries. We argue that this failure is primarily because of structural synthetic speech artifacts which are multi-timescale trajectory anomalies. Though every existing detector aggregates a fixed-window frame statistics, this misaligns the architecture with the signal. We propose FlowFake, a Liquid Time-Constant (LTC) architecture whose hidden state evolves via a learned ODE, with per-neuron adaptive time constants simultaneously resolving spectral (10ms) and prosodic (2s) cues. At only 34K parameters FlowFake achieves formal BIBO stability and O(dt^4) integration error. On a four-dataset cross domain benchmark (ASVspoof2019-LA, FakeOrReal, InTheWild, MLAAD), FlowFake reaches 75.29% on ASVspoof2019 trained only on FakeOrReal and 79.97% trained only on MLAAD. It outperforms RawGAT-ST and Whisper-DF on every evaluated pair and matching SSL Wav2vec2 (300x larger) at 0.01% of its parameter count. The source code is available on : this https URL
[AI-96] Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification
链接: https://arxiv.org/abs/2606.19568
作者: Sinclair Gurny,Ryan Quinn
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Acoustic gunshot detection is a problem with applications across civilian public safety, military operations, and wildlife conservation, yet the field lacks a rigorous exploration of feature extraction techniques with a focus on generalization to realistic data. The mixed effectiveness of commercial gunshot detection and classification systems indicates an open problem that is not adequately addressed by the current literature. In this paper, we present a systematic investigation of common feature extraction techniques using a dataset of 23,000 gunshot recordings across 85 firearms and 21 calibers. We benchmark three feature extraction techniques with 12 total unique parameter sets using ResNet-18. Our results demonstrate that using the correct feature extraction technique can improve top-1 accuracy by up to 20%, and utilizing the correct parameters for a given feature extraction technique can improve that value by up to 4.7%.
[AI-97] GDGU: A Gradient Difference-based Graph Unlearning Method for Cyberattack Localization in Electric Vehicle Charging Networks
链接: https://arxiv.org/abs/2606.19566
作者: Nanhong Liu,Mucun Sun,Jie Zhang
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Electric vehicle charging stations (EVCSs) can expose distribution feeders to cyberattacks. While machine learning methods, including graph neural networks, can localize which bus is compromised, significant challenges remain in data sharing and model training. For example, privacy regulations grant EVCS owners the right to delete their training data from a deployed model, yet retraining from scratch on every request is computationally prohibitive. To address this, we study graph unlearning (GU) for EVCS cyberattack localization, formulated as a feature-level unlearning problem on a graph-level multi-label classification task. Specifically, we propose gradient difference-based graph unlearning (GDGU), which removes the influence of the requested deletion data through a first-order parameter correction. The correction is computed from the gradient difference between the original training data and a modified dataset in which only the charging power features at the requested EVCS buses are unlearned. Then, a batch-normalization recalibration and a brief recovery fine-tuning step are applied to restore localization utility. We benchmark GDGU against two second-order GU baselines on the IEEE 34-bus, 123-bus, and 8500-node distribution networks across three graph neural network backbones and cumulative unlearning scenarios. GDGU matches the strongest baseline on localization utility and reaches forgetting fidelity close to full-retraining, while unlearning 10 to 12 times faster than retraining from scratch and using far less memory than the second-order GU baselines.
[AI-98] ITNet: A Learnable Integral Transform That Subsumes Convolution Attention and Recurrence
链接: https://arxiv.org/abs/2606.19538
作者: Ashim Dhor,Rasel Mondal,Pin Yu Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Convolutional networks, recurrent networks, and transformers each encode different inductive biases – locality, sequential memory, and content-dependent pairwise interaction – and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.
[AI-99] A Tool for the Synthesis of Adaptive Probabilistic Processors Based on the Ising Model MICRO
链接: https://arxiv.org/abs/2606.19533
作者: Jonathan Juracy Carneiro da Silva,Leonardo R. Gobatto,Jose Rodrigo Azambuja
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: ACM/IEEE/SBC/SBMICRO Symposium on Integrated Circuits and Systems Design 2026
Abstract:This work presents a tool for the synthesis and simulation of probabilistic architectures for solving combinatorial optimization problems by mapping them to the Ising model. The proposed approach automatically constructs the Ising Hamiltonian and determines the number of probabilistic elements (p-bits) based on problem characteristics such as size and topology. Furthermore, the tool introduces an adaptive strategy for selecting the most suitable update algorithm among Gibbs Sampling, Simulated Annealing (SA), Simulated Quantum Annealing (SQA), and cluster-based methods. Experimental results using benchmark problems demonstrate improved convergence behavior and flexibility compared to fixed approaches. The proposed framework enables systematic evaluation of probabilistic computing strategies and supports the development of future hardware implementations based on MTJs and p-bits.
[AI-100] chniques for Peak Memory Reduction for LoRA Fine-tuning of LLM s on Edge Devices
链接: https://arxiv.org/abs/2606.19528
作者: Hassan Dbouk,Matthias Reisser,Prathamesh Mandke,Likhita Arun Navali,Christos Louizos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Hassan Dbouk and Matthias Reisser contributed equally to this work
Abstract:Fine-tuning of Large Language Models (LLMs) using Low-Rank Adaptation (LoRA) on an end-user’s data offers personalized experiences while keeping data private, but faces severe memory constraints on consumer hardware. Peak memory during fine-tuning often exceeds device limits, especially for models with billions of parameters and long-context training data. This paper introduces a suite of complementary techniques to reduce memory footprint without sacrificing model quality: (1) base model quantization with on-the-fly dequantization, (2) memory-efficient checkpointing combining selective activation caching and disk offloading, (3) softmax approximation using semantically relevant token subsets, and (4) logits masking. Experiments on Llama-3.2 3B and Qwen-2.5 3B demonstrate up to 26\times and 28\times reduction in peak memory, enabling fine-tuning on resource-constrained devices.
[AI-101] Emergent Alignment ICML2026
链接: https://arxiv.org/abs/2606.19527
作者: Martin Kolář
类目: Artificial Intelligence (cs.AI)
备注: Rejected from ICML 2026
Abstract:Can Large Language Models (LLMs) discern when their own outputs are misaligned with human ethics? And can they self-correct? We endow an LLM with a conscience step that reviews its own reasoning and outputs, and we extend the training loss with an alignment component using Direct Preference Optimization (DPO) to steer the model away from non-ethical outputs. The result is an online technique to align models in a wide range of applications: training, fine-tuning, adversarial prompting, and zero-shot learning. It does not require a weaker or stronger judge, relying instead on a frozen copy of itself. In previous work, the Emergent Misalignment scenario showed a range of emergent unethical behaviors from fine-tuning the model to hack code. Instead, we empirically show how to achieve Emergent Alignment: a single high-level introspective question steers training toward an ethical model under the same code hacking scenario.
[AI-102] REVEAL: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimers Disease Risk MICCAI2026
链接: https://arxiv.org/abs/2606.19522
作者: Ethan Elio Meidinger,Seowung Leem,Zeyun Zhao,Ruogu Fang
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at MICCAI 2026
Abstract:The retina offers a noninvasive window into neurodegenerative disease, capturing subtle structural patterns associated with a risk of future cognitive decline. Vision-language alignment frameworks such as REVEAL have shown that pairing retinal fundus images with structured clinical risk narratives improves early prediction of Alzheimer’s disease (AD). A key design choice in these approaches is the use of phenotypic grouping, where individuals with similar risk profiles are treated as multi-positive pairs during contrastive learning. However, existing methods operationalize phenotypic similarity as a discrete construct, relying on hard group assignments that impose rigid supervision and decouple group formation from representation learning. We propose a continuous formulation of phenotypic structure within contrastive learning. Rather than assigning samples to fixed clusters, we model inter-subject similarity as a differentiable weighting function derived from intra-modality embedding similarities in both retinal images and risk profiles. These weights define soft multi-positive relationships through a continuous aggregation operator, enabling graded supervision that reflects the spectrum nature of disease risk. We further introduce a soft-target contrastive objective that jointly learns cross-modal alignment and phenotypic structure in an end-to-end manner. Evaluated on UK Biobank retinal imaging data for incident AD prediction, the proposed framework consistently outperforms discrete group-based contrastive learning and standard vision-language baselines. By treating phenotypic similarity as a learnable, continuous signal rather than a fixed grouping rule, our approach provides a principled and robust foundation for population-scale neurodegenerative risk modeling from multi-modal retinal and clinical data.
[AI-103] LLM Doesnt Know What It Doesnt Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data ICML2026
链接: https://arxiv.org/abs/2606.19509
作者: Akshat Dasula,Prasanna Desikan,Jaideep Srivastava
类目: Artificial Intelligence (cs.AI)
备注: Accepted at EIML@ICML 2026
Abstract:Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on such tasks remains unexplored. We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qwen 2.5 7B and XGBoost on a prediction task via attribution divergence analysis. We report four findings. First, LLM verbalized confidence is epistemically vacuous, it outputs a near-constant (0.856-0.937) regardless of whether accuracy is 49% or 75.3%, tracking prompt format rather than prediction quality. Second, the LLM exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. Third, few-shot examples and SHAP-derived feature evidence are orthogonal, super-additive interventions: they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% without training. Fourth, a cross-model calibrator that determined LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080, replacing uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference. We frame these findings as a cold start problem for LLMs on structured data and outline a path toward genuine epistemic self-awareness.
[AI-104] Hidden Anchors in Multi-Agent LLM Deliberation
链接: https://arxiv.org/abs/2606.19494
作者: Apurba Pokharel,Ram Dantu
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures, 7 tables
Abstract:Multi-agent LLM deliberation, where agents exchange and revise answers over several rounds, is increasingly used to improve reasoning and accuracy, yet how and why it works is rarely modelled. Such deliberation mirrors how humans reach decisions. As social animals we are pulled both by the group, the herd effect that classical opinion-dynamics models such as DeGroot and Friedkin–Johnsen capture, and by our own internal belief, which they do not. We model multi-agent deliberation as a closed-loop dynamical system in which each agent carries a hidden internal belief, its anchor, that continually pulls its opinion regardless of its neighbours. We show this anchor can be recovered from the deliberation alone, and that it explains a behaviour classical consensus rules forbid: an agent’s confidence in the correct answer can climb past where any agent started, escaping the space (convexhull) formed by the initial beliefs. Checking whether the recovered anchor also predicts held-out runs (generalizes) gives a simple test for when a model is truly driven bysuch an anchor. Across three open-weight model families this is a spectrum, not all-or-nothing. All anchors’ influence are about equally strongly, but they differ in where the anchor sits, and only when it sits far from the initial opinions does deliberation escape the hull and need the full closed-loop model.
[AI-105] Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks
链接: https://arxiv.org/abs/2606.19489
作者: Ya Wang,Adrian Paschke
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need for manual concept annotations. However, these models suffer from a critical limitation: as the number of concepts approaches the embedding dimension, information leakage increases, enabling the model to exploit spurious or semantically irrelevant correlations and undermining interpretability. In this work, we propose Concept Flow Models (CFMs), which replace the flat bottleneck with a hierarchical, concept-driven decision tree. Each internal node in the hierarchy focuses on a localized subset of discriminative concepts, progressively narrowing the prediction scope. Our framework constructs decision hierarchies from visual embeddings, distributes semantic concepts at each hierarchy level, and trains differentiable concept weights through probabilistic tree traversal. Extensive experiments on diverse benchmarks demonstrate that CFMs match the predictive performance of flat CBMs, while substantially mitigating information leakage by reducing effective concept usage. Furthermore, CFMs yield stepwise decision flows that enable transparent and auditable model reasoning with hierarchical class structures.
[AI-106] Can In-Context Learning Support Intrinsic Curiosity?
链接: https://arxiv.org/abs/2606.19476
作者: Eric Elmoznino,Sangnie Bhardwaj,Johannes von Oswald,Rajai Nasser,Blaise Agüera y Arcas,João Sacramento,Rif A. Saurous,Guillaume Lajoie
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or “intrinsic curiosity”, remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its “learning progress”, which measures how much a newly acquired observation improves a world model’s predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner’s prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.
[AI-107] Secure Coding Drift in LLM -Assisted Post-Quantum Cryptography Development: A Gamified Fix SIGIR
链接: https://arxiv.org/abs/2606.19474
作者: R.D.N. Shakya,C.P. Wijesiriwardana,S.M. Vidanagamachchi,Nalin A.G. Arachchilage
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted for 2026 SIGIR Workshop on Vulnerabilities in Generative Systems for Information Retrieval track
Abstract:The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.
[AI-108] Measuring Curriculum Alignment across Topical Coverag e Competency and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023
链接: https://arxiv.org/abs/2606.19469
作者: Sherzod Turaev,Mary John,Saja Aldabet,Mamoun Awad,Nazar Zaki,Khaled Shuaib
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 24 pages, 5 figures, 8 tables
Abstract:Undergraduate computer science is governed by international curricular guidelines revised about once a decade, yet programs lack a reliable, reproducible way to measure how completely they cover the current guidelines and how that coverage shifts when the guidelines are restructured. We address this with a human-in-the-loop pipeline that measures a program’s coverage of an external body of knowledge, applied longitudinally to one accredited BSc in Computer Science against Computer Science Curricula 2013 (CS2013) and 2023 (CS2023). The pipeline represents the program and each guideline as structured corpora, generates candidate course-to-knowledge-unit matches by semantic retrieval, and confirms them through human judgment under an explicit coverage definition. Of seven benchmarked retrievers, a reciprocal-rank-fusion ensemble was strongest, and a reputed long-context model underperformed a small sentence model, so retriever choice must be measured. Both maps were validated by an independent second rater (Cohen’s kappa 0.64 for CS2023, 0.69 for CS2013). The program covers 49.7% of CS2023 and 50.9% of CS2013 knowledge units, near-constant across a decade. Extending the same retrieve-then-confirm design to competency articulation and cognitive depth shows that the program articulates the competency for ~88% of covered units under each guideline, yet delivers it at the recommended depth for 76% of present units under CS2023 against 95% under CS2013, a gap reflecting the newer guideline’s raised expectations, not the program. The longitudinal comparison separates persistent structural gaps (parallel and distributed computing, foundations of programming languages, systems fundamentals), uncovered against both guidelines and ABET, from differences that reflect the standard’s evolution. The instrument is reusable and available from the authors on request.
[AI-109] Playful Agent ic Robot Learning
链接: https://arxiv.org/abs/2606.19419
作者: Junyi Zhang,Jiaxin Ge,Hanjun Yoo,Letian Fu,Zihan Yang,Yaowei Liu,Raj Saravanan,Shaofeng Yin,Justin Yu,Dantong Niu,Zirui Wang,Roei Herzig,Ken Goldberg,Yutong Bai,David M. Chan,Ion Stoica,Angjoo Kanazawa,Jiahui Lei,Haiwen Feng,Trevor Darrell
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.
[AI-110] JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis
链接: https://arxiv.org/abs/2606.19407
作者: Tingzhu Bi,Xinrui Jiang,Xun Zhang,Pengcheng Su,Congjie He,Jinglin Li,Ping Wang,Meng Ma
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.
[AI-111] VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving
链接: https://arxiv.org/abs/2606.19399
作者: Manish Acharya,Zhenyu Liao,Yueke Zhang,Kevin Leach,Yu Huang,Yifan Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注:
Abstract:LLM-based formal provers often collapse rich verifier signals (syntax errors, type mismatches, partial goal progress) into a binary pass/fail bit. We present VERITAS, a zero-shot framework that routes every verifier signal back into proof search through a two-phase protocol: Best-of-N sampling first, then a critic-guided MCTS pass that ingests Phase 1 failures as explicit negative examples. The protocol preserves every theorem solved by its own Phase 1 sweep, so Phase 2’s additional solves are attributable to feedback-driven exploration. VERITAS reaches 40.6% on miniF2F (vs. an independently run Best-of-5 at 36.9%, Portfolio 26.2%) and 7.3% on VERITAS-CombiBench, a 55-theorem combinatorics benchmark we release on which Best-of-5 (1.8%) falls below Portfolio (3.6%), exposing that unguided sampling hurts when correct lemma names must be recovered iteratively from verifier feedback. Artifacts are available on GitHub.
[AI-112] Execution-bound advisory automation for agent ic AI: a reproducible AIBOM-driven CSAF-VEX framework
链接: https://arxiv.org/abs/2606.19390
作者: Petar Radanliev,Omar Santos,Carsten Maple,Kay Atefi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:A protocol driven framework is presented that binds SBOM and AIBOM artefacts to deterministic environment capture and structured runtime telemetry. Exploitability is computed from declared artefacts, observed activation conditions, and enforced execution policies. CSAF VEX advisories are generated from combined static and runtime evidence, cryptographically signed, and validated through deterministic replay. Evaluation uses approximately 10000 component entries across synthetic Agentic AI workloads 50 to 5000 components, incorporating OSV, GitHub Advisory, KEV, and EPSS datasets.
[AI-113] Interpretable and Verifiable Hardware Generation with LLM -Driven Stepwise Refinement
链接: https://arxiv.org/abs/2606.19387
作者: You Li,Samuel Mandell,David Z. Pan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved remarkable success in software development. However, they are susceptible to hallucinations, meaning that they can introduce subtle semantic and logical errors. Due to the high stakes in chip design and manufacturing, hardware engineers are still reluctant to rely on LLMs for register-transfer level (RTL) generation. In this paper, we propose a hardware generation framework that combines the creativity and broad knowledge of LLMs with the explainability and mathematical rigor of formal methods. Specifically, we devise a set of transformation rules that cover various design decisions and hardware features. By iteratively applying these rules, an LLM agent can convert a design specification into an RTL program with guaranteed correctness. Experimental results demonstrate the effectiveness and efficiency of the framework.
[AI-114] Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence
链接: https://arxiv.org/abs/2606.19386
作者: Manvendra Modgil
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures. Sequel to arXiv:2606.04296 . Pre-registered; falsification clauses honored (H5 unsupported; H7 strict band 16/20) repo: this https URL
Abstract:Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor’s dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in 0…600s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt=1s a constant alarm (20/20; median 18 firings); at dt=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing.
[AI-115] DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling
链接: https://arxiv.org/abs/2606.19382
作者: Kanishk Kushwaha,Vikrant Vinod Bansode,Harsh Vardhan,Dhaval C. Patel
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 7 tables, 4 algorithms. Evaluated on the AssetOpsBench industrial benchmark. Code: this https URL
Abstract:While LLM-powered agents offer end-to-end automation for industrial asset lifecycles, real-world Industry 4.0 deployment is hindered by latency, concurrency instability, and safety risks. We present DynAMO (Dynamic Asset Management Orchestration), a deployment-ready engine using a Plan-then-Execute architecture to generate verifiable workflow graphs. DynAMO supports both SequentialWorkflow (topological execution) and ParallelWorkflow (dependency-aware concurrency). By dynamically identifying independent tasks, DynAMO preserves structural correctness and safety while significantly improving efficiency through controlled reasoning overlap. Across six controlled experiments on the AssetOpsBench industrial benchmark, DynAMO demonstrates substantial performance and robustness gains. Parallel execution reduces end-to-end latency by a median of 1.6x over sequential orchestration, rising to 1.8x on highly parallelizable workflows. After instrumenting external tool calls with realistic latencies, a latency decomposition shows that LLM reasoning and orchestration still account for more than 90% of execution time, identifying model inference as the primary system bottleneck. Structured context pruning reduces inference latency by approximately 30%, and DynAMO maintains correct functional behaviour (task completion, agent sequencing, and output quality) while exhibiting graceful degradation under controlled fault injection. Reproducibility analysis further confirms stable execution under repeated runs, with parallel scheduling reducing latency variance. These findings establish DynAMO as a practical blueprint for scalable, safe, and latency-aware agent deployment in Industry 4.0 automation pipelines. Code is available at: this https URL Comments: 11 pages, 2 figures, 7 tables, 4 algorithms. Evaluated on the AssetOpsBench industrial benchmark. Code: this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.19382 [cs.SE] (or arXiv:2606.19382v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.19382 Focus to learn more arXiv-issued DOI via DataCite
[AI-116] Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech INTERSPEECH2026
链接: https://arxiv.org/abs/2606.19381
作者: Yue Heng Yeo,Haoyang Li,Yizhou Peng,Shreyas Gopal,Hexin Liu,Leibny Paola Garcia-Perera,Hardik B. Sailor,Jeremy H. M. Wong,Eng Siong Chng
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2026
Abstract:Code-switch (CS) Automatic Speech Recognition (ASR) remains challenging due to limited availability of high quality CS text-speech pairs for training. Although synthetic data augmentation via Text-to-speech (TTS) has been explored, existing CS TTS approaches primarily optimise reconstruction fidelity and do not explicitly enforce language-boundary consistency, thereby limiting their effectiveness for CS ASR augmentation. This paper proposes a code-mixing guided preference-learning framework that steers synthetic speech generation toward improved code-switching fidelity using the Code Mixing Index (CMI). Experiments on the SEAME Mandarin-English conversational corpus demonstrate that the proposed method enhances the utility of synthetic data for ASR fine-tuning. Specifically, when fine-tuning Whisper Large, the proposed approach reduces Mixed Error Rate (MER) from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets, respectively.
[AI-117] Emyx: Fast and efficient all-atom protein generation
链接: https://arxiv.org/abs/2606.19377
作者: Nicholas J. Williams,Ward Haddadin,Matteo P. Ferla,Constantin Schneider,Nicholas B. Woodall,Ruby Sedgwick,Christian D. Madsen,Andrew L. Hopkins,Edward O. Pyzer-Knapp
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Computational enzyme design requires generating proteins that scaffold catalytic residues and ligands, a task that demands both geometric accuracy and structural diversity from the underlying generative model. Current all-atom generators inherit expensive architectures from structure prediction, leading to high training costs and limited sample diversity. We argue that much of this complexity is unnecessary for generators, which condition on sparse geometric constraints rather than rich co-evolutionary signals. Emyx is a 140M-parameter conditional flow matching model that concentrates capacity within standard transformer blocks, replacing heavy embedding stacks with lightweight conditional representations and sparse connectivity. We additionally derive an exact reparametrisation of the flow matching interpolant into the EDM noise-level framework, bridging flow matching training efficiency with state-of-the-art sampling methods designed for diffusion models without retraining. Despite being the smallest model, Emyx outperforms both Proteína-Complexa and RFdiffusion3 against the AME enzyme design benchmark across success rate under strict evaluation requiring both global fold recovery and catalytic geometry accuracy, structural novelty, scaffold diversity, and geometric validity, while training in just 682 GPU-hours, roughly 4\times less than RFdiffusion3.
[AI-118] Protein Representation Learning with Secondary-Structure and Energy-Filtered Hydrogen-Bond Graphs
链接: https://arxiv.org/abs/2606.19374
作者: Mohamed Mouhajir,Limei Wang,El Houcine Bergou,Hajar El Hammouti,Lamiae Azizi,Dongqi Fu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph-based representations are widely used in protein modeling, yet many existing approaches rely primarily on sequence adjacency or geometric proximity, which only partially reflect the principles governing protein folding. Proteins instead adopt complex three-dimensional conformations organized around secondary structure elements, such as \alpha -helices and \beta -sheets, which encode recurring local motifs and stabilizing hydrogen-bond interactions. In this work, we introduce a secondary-structure-aware graph neural network for protein representation learning. Residue-level node representations are augmented with secondary structure assignments, and graph edges are constructed from hydrogen-bond interactions filtered by their energetic strength. This design enables the model to capture both local structural context and long-range couplings that are central to protein stability and function. We evaluate the proposed approach on commonly used protein benchmarks and observe consistent improvements over existing graph-based methods. In addition, the resulting graph representations offer enhanced biological interpretability, as the learned connectivity aligns with established structural motifs. These findings suggest that incorporating secondary structure and energy-filtered hydrogen-bond topology provides an effective inductive bias for protein representation learning. The code is released at this https URL
[AI-119] cAPM: Continual AI-Assisted Pace-Mapping with Active Learning
链接: https://arxiv.org/abs/2606.19373
作者: Dylan O’Hara,Pradeep Bajracharya,Casey Meisenzahl,Karli Gillette,Anton J. Prassl,Gernot Plank,Saman Nazarian,Roderick Tung,John L Sapp,Linwei Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Ventricular tachycardia is a life-threatening rhythm disorder and a major cause of sudden cardiac death. Pace-mapping is a clinical procedure for identifying the intervention target during catheter ablation of VT. It requires clinicians to pace different sites in the ventricles and rapidly interpret the resulting electrocardiograms to determine where to pace next or whether a target site has been identified. Active learning AI models have been proposed to guide clinicians to the next pacing site, showing promise in reducing the number of pacing sites and improving the efficiency of pace-mapping. Existing methods require retraining each target without the ability to transfer knowledge across multiple VTs within the same patient or across patients. We introduce cAPM for continuous AI-assisted pace-mapping to capture and transfer knowledge accumulated from past pace-mapping data to reduce the number of pace-mapping data needed for future target VTs. This is made possible by a task-agnostic surrogate neural network that learns the mapping from pacing sites to 12-lead ECG morphology, an active-learning strategy that refines this surrogate model by selecting the most informative pacing site for each target, and a continual learning strategy to do so sequentially while retaining knowledge from prior targets. Evaluated on an in-silico testbed consisting of sequentially-presented localization tasks across different physiological conditions and ventricular geometries, cAPM with and without replay of past data samples achieved an 81% probability of localizing within clinical tolerance (5 mm accuracy) using 4.5 pace-mapping sites, compared to the state-of-the-art active-learning method achieving 38% probability using 13.7 pacing sites. These results provide a strong basis for preparing cAPM towards in-vivo preclinical and clinical studies where it can be used to guide pace-mapping.
[AI-120] Zero-Inflated Gaussian Distributions Enable Parameter-Space Sparsity in Estimation-of-Distribution Algorithms
链接: https://arxiv.org/abs/2606.19369
作者: Andreas Faust,Sven Nitzsche,Juergen Becker
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Estimation-of-distribution algorithms (EDAs) are a powerful class of evolutionary methods for black-box optimization, especially when little is known about the structure of the objective. Whereas classical evolutionary algorithms rely on hand-designed mutation and crossover operators, hard to devise for unknown problem structures, and a source of bias, EDAs sidestep operator design entirely: they fit a probability distribution to the best individuals and sample the next generation from it. EDAs are well established on continuous parameter spaces, but they have not previously been generalized to sparse ones, in which most coefficients of a good solution are exactly zero. Existing sparse black-box optimizers therefore reintroduce exactly what EDAs were designed to avoid: hand-crafted sparsity operators, bi-level schemes alternating between support set and active values, zeroing thresholds, and other baked-in assumptions. We close this gap by proposing multivariate zero-inflated Gaussian (ZIG) distributions as EDA sampling laws. A latent Gaussian model with separate indicator and value dimensions represents sparsity patterns, correlations among active parameters, and the interactions between the two, so sparsity patterns and active values are optimized jointly, hierarchy-free. We show that the latent parameters of this model are identifiable from observed samples, unlike in the missing-data settings where related constructions originate, and introduce practical amortized inversion-based estimators for them. The estimators accurately recover latent correlation structures, and on the Lunar Lander benchmark the resulting ZIG-EDA converges faster and reaches higher final returns than a dense Gaussian EDA, a hand-crafted sparse evolutionary algorithm, and an ad-hoc sparse EDA, while finding controllers with only a small fraction of parameters active.
[AI-121] Information Lattice Learning as Probabilistic Graphical Model Structure Learning
链接: https://arxiv.org/abs/2606.19366
作者: Haizi Yu,Lav R. Varshney
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Information lattice learning (ILL) learns interpretable rules of a signal by alternately projecting the signal onto a partition lattice that encodes a hierarchy of abstractions and lifting selected rules back to the signal domain. When the signal is a probability mass function, we show the probabilistic rules learned by ILL admit a natural probabilistic graphical model (PGM) interpretation and develop this interpretation in detail. A partition in ILL induces a deterministic quotient variable, and a rule is the marginal law of that quotient variable. A rule set is therefore a collection of marginal constraints over interpretable abstractions. General lifting is the feasible family of all joint distributions satisfying those constraints, while special lifting chooses a maximum-ignorance reconstruction, implemented in ILL by an L2 uniformity principle closely related to maximum entropy. Under a Shannon-entropy lifting, the same constraints yield a log-linear factor graph whose factors are indexed by learned abstractions. The information lattice itself, however, is not a Bayesian network: its edges encode refinement and coarsening of abstractions, not conditional dependence. Thus ILL is best viewed as structure learning for interpretable constraint-based factor graphs over quotient variables. This view clarifies how ILL relates to graphical models and maximum entropy models, while suggesting new directions for inference, identifiability, and hybrid symbolic-probabilistic learning.
[AI-122] Computational Identifiability
链接: https://arxiv.org/abs/2606.19361
作者: Lucius E.J. Bynum,Rajesh Ranganath,Kyunghyun Cho
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:Identification conditions describe the computability of a target query or parameter of interest as a function of the type and amount of information available. In causal identification, this information is often expressed in the form of a causal graph, and data are observed or collected for some subset of variables in the graph. Target queries may be for a single effect alone or for a class of effects in a given model. The derivation of an identification algorithm then defines mathematically the process by which the desired causal effect(s) can be uniquely determined, theoretically, in expectation. Identifiability in expectation, or ‘theoretical identifiability,’ generally assumes asymptotic properties, infinite data, or other mathematically idealized conditions. In this paper, we explore a fundamental distinction between this theoretical, idealized notion of identifiability and a proposed alternative that is computation-bound. The framework we propose - ‘computational identifiability’ - is to instead define a finite computational search procedure for an empirical estimator. If this process finds an estimator empirically, within a desired error tolerance, then identifiability is satisfied, conditional on the specified assumptions of the search (i.e., a prior distribution over the parameters) and conditional on the search procedure itself. Through several experiments, we demonstrate how this framework allows us to answer fine-grained, practical identification questions, such as identification with small finite samples, with ambiguous graphical criteria, with mixed observational-interventional data, and across counterfactual data and estimands. Code is available at this https URL.
[AI-123] Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots
链接: https://arxiv.org/abs/2606.19357
作者: Khurram Javed,Joseph Modayil,Gloria Kennickell,Richard S. Sutton,John Carmack
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: To appear at RLC 2026
Abstract:We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroller and the Atari Devbox, together with an off-the-shelf camera and a desktop computer, constitute a system that can be used to study reinforcement learning algorithms in the physical world. We call the full system Physical Atari. In this paper, we detail the key decisions that make Physical Atari a robust and accessible platform. To make the system robust, we designed the Robotroller so that all movement is done through bearings, which reduces wear. Additionally, we wrote software that monitors the state of the servos at a high frequency and intervenes to limit stress. To make the system accessible, we used affordable off-the-shelf components and parts that can be manufactured using consumer 3D printers. Physical Atari can be built for under 1,000 and has been used for weeks of non-stop reinforcement learning experiments without any mechanical failures. We used it to validate that reinforcement learning algorithms can learn directly on robots and show that even small distribution shifts between learning and deployment can significantly degrade the performance of policies. Our results underscore the importance of on-device adaptation for strong performance on robots.
[AI-124] Optimal Order of Multi-Agent and General Many-Body Systems
链接: https://arxiv.org/abs/2606.20485
作者: Jake J. Xia
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO); Physics and Society (physics.soc-ph)
备注: Key Words: Many body systems, multi agent crowd interactions, feedback loops, agent power, response function, utility function, risk appetite, order, optimal order, fragility, mobility, synchronization, useful energy, entropy, concentration, correlation, task dependency, receiver dependency, collective intelligence, AI model scaling law
Abstract:This paper develops a general framework for analyzing multi-agent systems with feedback loops between agents actions and collective observations. The framework is built on two fundamental agent-level variables: power, which measures agent influence on collective outcomes, and response functions, which determine how agents react to observations. We derive how macroscopic properties, including total power, useful power, entropy, order, fragility, and mobility, emerge from these two variables of heterogeneous agents. To study the trade off between growth and resilience, we introduce a system-level utility function parameterized by a risk-appetite coefficient and derive an optimal degree of order that balances productivity, stability, and adaptability. The analysis suggests that stronger synchronization can increase collective output but may also increase systemic fragility and reduce mobility. We further argue that order, entropy, information, and useful energy are task-dependent and system-relative concepts whose meanings depend on the objectives of the system. By measuring and designing agent power distributions and response functions, it may be possible to better understand, predict, and optimize collective behavior and identify the conditions under which collective intelligence and optimal order emerge.
[AI-125] Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation INTERSPEECH2026
链接: https://arxiv.org/abs/2606.20457
作者: Rostislav Makarov,Timo Gerkmann
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in the Proceedings of Interspeech 2026
Abstract:Classifier guidance is a way to control diffusion generation by using a noise-conditioned classifier to steer the sampling process toward a target class. One drawback of classifier guidance is that it requires two separately trained models: a classifier and a diffusion model. We therefore study a more compact alternative in which a conventionally trained speech classifier is repurposed as the backbone for diffusion generation. Starting from a frozen noise-conditioned classifier in log-Mel space, we attach a lightweight subnetwork that reuses intermediate classifier representations and train only this subnetwork under a Denoising Score Matching objective. Our work shows that a pretrained classifier can be repurposed for conditional generation, providing an appealing bridge between discriminative modeling and conditional speech synthesis resulting in high speech quality within a single-backbone model, with reduced memory footprint and computational cost.
[AI-126] Robust Q-learning for mean-field control under Wasserstein uncertainty in common noise
链接: https://arxiv.org/abs/2606.20356
作者: Mathieu Laurière,Ariel Neufeld,Kyunghyun Park
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
备注:
Abstract:In this article, we present a robust Q -learning algorithm for discrete-time mean-field control problems under Wasserstein uncertainty in the common noise law. The algorithm combines a quantization-and-projection scheme with a Wasserstein dual reformulation on the common-noise space. We establish its convergence together with finite-time iteration bounds for both synchronous and asynchronous learning schemes. Numerical experiments on systemic risk and epidemic models compare the asynchronous implementation with an idealized Bellman iteration, illustrate the robustness-performance tradeoff under common-noise misspecification, and report the observed convergence behavior of the asynchronous Q -learning algorithm.
[AI-127] Evaluation of EEG Foundation Models for Event-Based Burst-Suppression Detection in ICU
链接: https://arxiv.org/abs/2606.20074
作者: Elisa Vasta,Thorir Mar Ingolfsson,Andrea Cossettini,Luca Benini,Tilman Beck,Emanuela Keller,Una Pale
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 1 figure. Code available upon publication
Abstract:Burst suppression (BS) is a clinically relevant electroencephalographic (EEG) pattern used to monitor sedation depth and brain activity in critically ill patients, particularly during induced coma in Intensive Care Units (ICUs). Automatic burst detection remains challenging because BS patterns vary substantially between patients and annotated datasets are scarce. Recently, EEG Foundation Models (FMs) have shown promise across several downstream EEG applications, but their usefulness for BS detection remains unexplored. We present the first study to evaluate EEG FMs for burst detection in reduced-montage ICU EEG without patient-specific calibration. We compare REVE-base, LUNA-large and LuMamba-Tiny with an adaptive thresholding baseline and a task-specific EEGNet baseline. Additionally, we complement conventional EEG window-based classification with event-based burst detection evaluation. This helps assessing clinically whether burst episodes are correctly detected, reducing the impact of expected annotation variability. The best model, REVE-base, achieved the highest event-based F1-score ( 0.868 \pm 0.167 ) and reduced burst-per-minute error by 52.1% and 36.2% compared to EEGNet and adaptive thresholding respectively, supporting FMs for scalable EEG monitoring in ICU. Ablation experiments showed that full fine-tuning was the most effective adaptation strategy with respect to frozen-backbone training, two-step fine-tuning, and LoRA-based adaptation, improving event-based F1-score over frozen-backbone training by up to +0.102 for LUNA-large. With reduced labeled datasets, pretrained REVE-base outperformed random initialization by +0.723 event-based F1 points at 25% of the cohort, demonstrating the benefit of pretraining FM representations when adapted to burst detection with limited labeled data.
[AI-128] AI Economist Agent : An Agent ic Framework for Model-Grounded Economic Analysis with RAG Knowledge Graphs and Large Language Models
链接: https://arxiv.org/abs/2606.20041
作者: Masahiro Kato
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Finance (q-fin.GN)
备注:
Abstract:We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.
[AI-129] SIMBA: ABidirectional Retrieval Forward Simulation Framework for Modeling FY-4A GIIRS Hyperspectral Infrared Radiances Toward NWP Applications
链接: https://arxiv.org/abs/2606.19943
作者: Jingdong Shen,Fu Wang*,Qifeng Lu,Hao Huang,Chunqiang Wu,Chi Yang,Xiaofang Liu
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyperspectral infrared observations are an important data source for numerical weather prediction (NWP) because they provide rich information on the vertical structure of atmospheric temperature and humidity. However, most existing deep learning methods mainly focus on one-way retrieval from radiances to atmospheric profiles, while the reverse radiance simulation process and the consistency between atmospheric state space and radiance observation space are insufficiently considered. In this study, we propose SIMBA, a unified bidirectional retrieval-forward simulation framework for FY-4A GIIRS hyperspectral infrared radiance modeling toward NWP applications. The framework jointly performs atmospheric profile retrieval and radiance reconstruction, introduces a cycle-consistency constraint to strengthen the coupling between the two processes, and employs a bidirectional Mamba state-space module to capture long-range dependencies along pressure levels. Using collocated FY-4A GIIRS observations and ERA5 reanalysis data, the proposed method is evaluated for temperature retrieval, specific humidity retrieval, long-wave radiance reconstruction, and medium-wave radiance reconstruction. Experimental results show that SIMBA outperforms several representative deep learning baselines across both retrieval and reconstruction tasks, while ablation experiments confirm the contribution of the bidirectional design and cycle-consistency mechanism. These results demonstrate that the proposed framework is effective for joint atmospheric profile retrieval and hyperspectral infrared radiance modeling, and suggest potential for future Jacobian-related analysis and NWP-oriented extensions.
[AI-130] Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation
链接: https://arxiv.org/abs/2606.19797
作者: Paban Sapkota,Hemant Kumar Kathania,Sudarsana Reddy Kadiri,Shrikanth Narayanan
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注:
Abstract:Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ( s =0.8) for \textitlow (9.02%) and \textitmedium (38.11%) severities, and with PM ( \tau =0.8) for \textithigh severity (55.15%), reflecting relative improvements of 30.02%, 16.64%, and 15.47%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.
[AI-131] Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models
链接: https://arxiv.org/abs/2606.19793
作者: Paban Sapkota,Hemant Kumar Kathania,Mikko Kurimo,Sudarsana Reddy Kadiri,Shrikanth Narayanan
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
备注:
Abstract:The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65% relative improvement in isolated word recognition and a 4.63% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.
[AI-132] Cross-Dataset Age and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Childrens ASR
链接: https://arxiv.org/abs/2606.19791
作者: Paban Sapkota,Hemant Kumar Kathania,Mikko Kurimo,Sudarsana Reddy Kadiri,Shrikanth Narayanan
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65% relative improvement in isolated word recognition and a 4.63% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.
[AI-133] owards Engineering Scaling Laws with Pretraining Data Composition
链接: https://arxiv.org/abs/2606.19781
作者: Jan-Lucas Uslu,Kevin Greif,Daniel Whiteson,Benjamin Nachman
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural scaling laws describe how model performance improves as a power law in compute, model size, and dataset size. While well-established for large language models, these relationships are emerging for large models in particle physics. As with language, empirical studies show that the performance scales as a power law. However, unlike natural language or image domains, fundamental physics has high-fidelity simulators that produce synthetic data cheaply. This favors scaling regimes where additional data is cheaper than additional parameters, and allows the pretraining dataset itself to be engineered to influence the scaling. For the task of classifying hadronic jets produced in collisions of high-energy particle beams, we show that the scaling behavior can be engineered towards requiring more data rather than larger models by inclusion of pretraining data which is more diverse and better aligned with the downstream classification task.
[AI-134] AURA: Adaptive Uncertainty-aware Refinement for LLM -as-a-Judge Auditing
链接: https://arxiv.org/abs/2606.19714
作者: Zilong Zhang,Yi-Ting Hung,Weiyi He,Junxi Zhang,Lei Ding,Chi-Kuang Yeh
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
备注:
Abstract:Large language models (LLMs) are increasingly used as judges for open-ended generation, as large-scale human evaluation is often expensive and difficult to scale, yet their preferences remain imperfect proxies for human judgment. Existing auditing pipelines often assume that a reliable subset of examples or clean supervision signals are available beforehand, for example from human annotation, heuristic filtering, or the outputs of strong judges. In LLM evaluation, this assumption is fragile: the initial split may inherit judge bias, while human verification is typically too scarce to define stable groups at scale. We propose AURA, an adaptive uncertainty–aware refinement framework for auditing pairwise LLM–as–a–judge decisions under selected human verification. AURA iteratively learns a human-consistency signal, propagates reliable evidence, and prioritizes uncertain comparisons for human review. The key idea is to treat trust in a judge as a latent quantity that is progressively refined as evidence accumulates. We provide a compact formulation, a stable refinement procedure, and a comprehensive evaluation on both synthetic and real pairwise LLM-answer data.
[AI-135] Review of Machine Learning Models for Solar Energetic Particle Prediction
链接: https://arxiv.org/abs/2606.19539
作者: Spiridon Kasapis,Pouya Hosseinzadeh,Kathryn Whitman,Ricky Egeland,Manolis Georgoulis,Angelos Vourlidas,Athanasios Papaioannou,Eleni Lavasa,Anastasios Anastasiadis,Giorgos Giannopoulos,Andres Munoz-Jaramillo,Bala Poduval,Irina N. Kitiashvili,Alexander G. Kosovichev,Viacheslav Sadykov,Soukaina Filali Boubrahimi,Tate T. Hutchins,Hameedullah A. Farooki,Manuel E. Cuesta,Leng Y. Khoo,Sungmin Pak,Robert Czarnota,Jamie S. Rankin,Jamey Szalay,Mitchell M. Shen,Georgios Livadiotis,Zigong Xu,David J. McComas,Nikolaos Sarlis,Dionissios Hristopulos,Arik Posner,Alec J. Engell,Mohammed AbuBakr Ali,Ali G. A. Abdelkawy,Abdelrazek M. K. Shaltout,M. M. Beheary,Christina O. Lee,Sigiava Aminalragia-Giamini,Constantinos Papadimitriou,Ingmar Sandberg,Savvas Raptis,Shah Muhammad Hamdi,Monica Laurenza,Mirko Stumpo,Sumanth A. Rotti,India Jackson,Aatiya Ali,Atilim Gunes Baydin,Nathan Schwadron,Subhamoy Chatterjee,Maher A. Dayeh,Gelu M. Nita,Patrick M. O’Keefe,Chun Jie Chong,Paul Kosovich,Russell D. Marroquin,Berkay Aydin,Petrus C. Martens,Lulu Zhao,Yang Chen,Yian Yu,Monica G. Bobra,Ward Manchester,Tamas Gombosi,Ming Zhang,Jesse Torres,Philip K. Chan,Mohamed Nedal,Kamen Kozarev,Peijin Zhang,Kimberly Moreland,Hazel M. Bain,Samuel Hart,Michael J. Starkey,Alan G. Ling,Simone Benella
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI)
备注: Review Paper, Maine text: 23 pages, References: 5 pages, Appendix: 42 pages
Abstract:Solar energetic particle (SEP) events have attracted increasing attention due to their significant radiation hazards for aviation, spacecraft electronics, and human missions beyond Earth’s magnetosphere. From a scientific perspective, SEP events are intriguing because they arise from a set of physical processes extending from the solar surface and corona through the heliosphere, offering insight into particle acceleration and transport mechanisms that are widely applicable across astrophysics. Therefore, advancing our ability to understand and predict SEP events is essential both for deepening our knowledge of such mechanisms and for safeguarding space technologies and exploration. Traditionally, researchers have modeled SEPs using physics-based simulations and empirical methods. More recently, machine learning (ML) has emerged as a new tool for understanding and predicting SEP events. The purpose of this manuscript is to review the currently available ML models for SEP prediction, identify the datasets used for training, compare their architectures, inputs, and outputs, and, based on these insights, outline good practices and recommendations for future research.
机器学习
[LG-0] Optimal Deterministic Multicalibration and Omniprediction
链接: https://arxiv.org/abs/2606.20557
作者: Georgy Noarov,Aaron Roth
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:A model is multicalibrated on a collection of group weights G if it is calibrated – i.e. unbiased even conditional on its prediction – not just overall, but also after reweighting contexts by each g \in G . It is a useful property for many downstream applications and is a basic desideratum of trustworthy machine learning. Before this work, all predictors known to attain the minimax-optimal \widetilde O(\varepsilon^-3) sample complexity rate for \varepsilon -multicalibration were randomized, while deterministic predictors were known only with substantially worse sample complexity. Whether randomization is necessary for optimal sample complexity in multicalibration was explicitly asked by [CLNR26] and implicitly in several prior works. We resolve this open problem by giving a minimax-optimal multicalibration algorithm that outputs a deterministic predictor. We then generalize the algorithm to produce optimal deterministic predictors that satisfy outcome indistinguishability (OI) with respect to finite or finitely covered collections of tests. As an application, this also gives deterministic omnipredictors and panpredictors with optimal sample complexity, resolving open problems posed by [OKK25] and [BHHLZ25]. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2606.20557 [cs.LG] (or arXiv:2606.20557v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.20557 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-1] Predictability as a Fine-Grained Measure for Privacy
链接: https://arxiv.org/abs/2606.20546
作者: Linda Lu,Karthik Sridharan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Differential privacy (DP) ensures rigorous individual-level privacy guarantees against even the most knowledgeable attackers, but its worst-case nature can impose a costly privacy-accuracy tradeoff. We introduce privacy via predictability, a fine-grained framework that explicitly incorporates the attacker’s core knowledge, a compromised portion of the dataset generated by a stochastic process, and a specified family of queries. Predictability measures privacy leakage as the incremental gain in an attacker’s ability to predict sensitive information about unknown individuals after observing the algorithm’s output, beyond what can already be inferred from the compromised data. We show that predictability and DP are generally incomparable: each can be small while the other is large. However, in the worst-case regime where all but one individual is compromised, and all binary queries are considered sensitive, predictability implies mutual-information DP. More generally, predictability provides a finer-grained privacy metric tailored to specific sensitive information and specific attacker models. We introduce a general framework, using the generalized method of moments (GMM), to analyze asymptotic predictability when the compromised data is generated by a stationary, ergodic, mixing process. Using this analysis, we derive a predictability-calibrated output perturbation scheme for ERM. Our approach is complementary to DP and can be used alongside DP to provide fine-grained privacy control.
[LG-2] Multi-Task Bayesian In-Context Learning ICML2026
链接: https://arxiv.org/abs/2606.20538
作者: Qingyang Zhu,Eric Karl Oermann,Kyunghyun Cho
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Bayesian predictive inference provides a principled framework for uncertainty quantification, data efficiency, and robust generalization. However, exact inference is often intractable, and scalable approximations may remain computationally expensive or require restrictive modeling assumptions that degrade predictive performance. Prior-Data Fitted and in-context models have recently emerged as an amortized alternative by learning to map datasets directly to predictive distributions, but existing approaches are tightly coupled to the support of the training prior and lack explicit mechanisms for adapting to new priors at test time, resulting in limited robustness under distribution shift. We introduce a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference that explicitly represents prior information as a prefix of in-context datasets. A transformer trained on sequences of prior and target tasks learns to adapt its predictions across families of priors. On a suite of evaluations with increasing difficulty, including out-of-meta-distribution priors and priors with high-dimensional latent structures, our method matches oracle Bayesian predictors while being orders of magnitude faster. We further demonstrate its practical relevance on a real-world spatiotemporal temperature prediction benchmark. Code is available at this https URL.
[LG-3] Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency Small-Batch On-Device Physical-AI Serving
链接: https://arxiv.org/abs/2606.20537
作者: Liang Su
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 27 pages, 9 figures
Abstract:Mainstream LLM serving systems reuse prefix work mainly through paged or radix key-value (KV) caches. This is highly effective for high-throughput, high-concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low-latency, small-batch, on-device physical-AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re-enter under tight responsiveness budgets. We introduce execution-state capsules, a graph-bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white-box, backend-facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block-table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token-addressed KV fragments to graph-bound execution-state boundaries. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. A KV-only ablation diverges, showing that recurrent state is load-bearing. GPU-resident snapshot and restore are sub-millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high-throughput KV-cache serving; they define a complementary latency-first serving point for explicit execution-state reuse.
[LG-4] Probe-and-Refine Tuning of Repository Guidance for Coding Agents
链接: https://arxiv.org/abs/2606.20512
作者: Asa Shepard,Jeannie Albrecht
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \textttthis http URL files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emphprobe-and-refine tuning: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository’s guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0,% mean resolve rate vs.\ 28.3,% for the static knowledge base used to initialize it and 25.5,% for an unguided baseline ( p 0.001 for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ( \sim 59,%, p = 0.119 ), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.
[LG-5] Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution
链接: https://arxiv.org/abs/2606.20475
作者: Mingyu Yang,Keye Zheng,Congchao Cheng,Yujie Liu,Xingkang Lu,Fan Jiang,Yefei Zheng
类目: Machine Learning (cs.LG)
*备注: 26 pages, 4 figures, 10 tables, 42 references
Abstract:In batch-style trace distillation, the same memory operation may receive contradictory feedback across different batches. Existing methods lack a cross-batch, operation-level evidence accumulation mechanism, making it impossible to distinguish stably effective operations from accidental hits. This paper formalizes the requirement as two structural conditions, alignability and comparability, and proposes Marginal Advantage Accumulation (MAA). MAA constructs differential signals to make them comparable across batches, accumulates signed evidence per operation via EMA, and ensures cross-batch traceability through semantic identity merging. As a post-processing architecture, MAA achieves the best results in 14 out of 16 settings across 4 benchmarks and 4 target models, consistently outperforming existing batch-level distillation baselines and matching or surpassing online alternatives in most settings, while reducing optimization-phase token consumption by approximately 75%.
[LG-6] Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat Minima
链接: https://arxiv.org/abs/2606.20469
作者: Md Sakir Ahmed,Kumaresh Sarmah,Hemen Dutta
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注: 18 pages, 5 figures, preprint
Abstract:A widely held intuition in deep learning is that stochastic gradient descent (SGD) implicitly favors flat minima and that flat minima generalize better, but standard Euclidean measures of flatness such as the trace or maximum eigenvalue of the loss Hessian are not invariant under reparametrizations that preserve the network function, which undermines the theoretical foundations of this narrative. In this study we resolve this issue by grounding flatness in the Riemannian geometry of the statistical manifold induced by the Fisher Information Matrix (FIM). We define Riemannian sharpness mathematically and prove that it is invariant under smooth, function-preserving reparametrizations, which directly addresses the critique of Dinh et al. in the paper ``Sharp minima can generalize for deep nets’'.We note that this invariance is a property of the true FIM; the diagonal empirical estimator used in practice (and in all experiments below) inherits invariance only approximately, and exact invariance under arbitrary reparametrizations would require structured estimators such as K-FAC. We formalize the gradient noise of mini-batch SGD as having a covariance structure proportional to the FIM, derive the stationary distribution of the resulting stochastic differential equation, and then show that the probability mass is exponentially concentrated at Riemannian-flat minima. A PAC-Bayes generalization bound controlled explicitly by SR formally links this geometric bias to test performance. Our experiments on MNIST and CIFAR-10 confirm that SR reliably tracks generalization in ways that Euclidean sharpness does not, and that its scaling with \eta/B matches the theoretical predictions. Together these results provide a rigorous, reparametrization-invariant account of why flat minima generalize.
[LG-7] Agent ic Symbolic Search: Characterizing PDEs Beyond Hand-crafted Expressions Meshes and Neural Networks
链接: https://arxiv.org/abs/2606.20467
作者: Zongmin Yu,Liu Yang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:
Abstract:Mathematicians understand a PDE solution through mathematical structures rather than tables of computed values. Historically, this has been the product of mathematical analysis, carried out by hand for each problem individually. Neither numerical simulation nor neural networks produce those structures directly. We propose Agentic Symbolic Search (ASYS), a prior-guided framework in which an agent translates PDE theory, public problem constraints, and accumulated search experience into testable differentiable symbolic programs. The mathematical forms are refined under evolutionary search, while their continuous parameters are fit by gradient-based optimization. This makes the search an automated form of inductive-bias injection rather than blind symbolic regression. For problems with known analytical forms, ASYS recovers these forms naturally; for other problems, ASYS constructs analytical approximations which can guide mathematicians toward further analysis. In our experiments, across five problems spanning bounded dynamics, finite-time blow-up, and free-boundary focusing, ASYS produces interpretable representations, including a geometric interface formula for Allen-Cahn 2D dynamics and a nine-parameter contraction law for Keller-Segel chemotactic blow-up, in settings where no closed-form description was previously available. ASYS shows the possibility of a new paradigm for characterizing PDE solutions, beyond handcrafted analytical solutions, mesh-based numerical solutions, and neural network approximations.
[LG-8] Data Bias Mitigation under Coverag e Constraints The Price of Fairness
链接: https://arxiv.org/abs/2606.20461
作者: Bruno Scarone,Alfredo Viola,Renée J. Miller
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Databases (cs.DB)
*备注: Accepted to FAccT 2026
Abstract:Machine learning models have been shown to exhibit discriminatory outcomes or degraded performance for individuals at the intersection of multiple sensitive attributes, such as race and gender. This stems in part from two interrelated challenges: the lack of principled measures for quantifying bias (potentially intersectional), and insufficient representation of intersectional subgroups in training data. We extend a recent bias mitigation framework to incorporate coverage constraints that enforce sufficient representation across groups, including intersectional subgroups. Since achieving exactly zero bias for all groups may not be data efficient (meaning it may require large amounts of data), our solution trades small approximation errors in bias for greater data efficiency while satisfying coverage constraints. We also formulate bias mitigation as an integer linear program that optimizes over all mitigation strategies, and characterize the price of fairness, the minimum data modification cost, as a function of fairness tolerance. This is essential both for legal compliance, where regulations may mandate specific fairness thresholds, and for data governance, enabling practitioners to make informed trade-offs between bias reduction and data modification (particularly, data purchasing) costs. We evaluate our techniques on publicly available datasets, demonstrating that bias mitigation via our framework preserves predictive accuracy across multiple classifiers, and that coverage constraints, while motivated by statistical considerations, are essential for preserving downstream ML performance.
[LG-9] opological Data Analysis for High-Dimensional Dynamic Process Monitoring
链接: https://arxiv.org/abs/2606.20443
作者: Angan Mukherjee,Tyler A. Soderstrom,Michael J. Kurtz,Victor M. Zavala
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:
Abstract:Real-time process monitoring requires methods that extract actionable information from high-dimensional time-series data. In this work, we present a new approach for process monitoring that combines tools of topological data analysis (TDA) and machine learning. In the proposed approach, we represent multivariate time-series data as manifolds and use topological descriptors to summarize the structure of such data; we then use a neural ordinary differential equation to learn the dynamic evolution of the topological structure of the system. Using real data from an industrial process, we show that this trajectory-based event detection approach is effective at detecting diverse types of events. We contrast this approach against reconstruction-based approaches such as principal component analysis and autoencoders and against a trajectory-based approach that uses Koopman autoencoders.
[LG-10] Evolutionary Two-Stage Hyperparameter Optimization Strategies for Physics-Informed Neural Networks ICLR2026
链接: https://arxiv.org/abs/2606.20442
作者: Fedor Buzaev(1),Dmitry Efremenko(1),Egor Bugaev(1),Andrei Ermakov(1 and 2),Denis Derkach(1),Daria Pugacheva(1 and 2),Fedor Ratnikov(1) ((1) HSE University, (2) AXXX)
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注: Equal advising: Daria Pugacheva and Fedor Ratnikov. Accepted to the ICLR 2026 Workshop on AI and PDEs
Abstract:Physics-Informed Neural Networks (PINNs) solve Partial Differential Equations (PDEs) by embedding physical laws into neural network training. However, their performance suffers from unstable convergence, training plateaus, and strong sensitivity to architectural and optimization hyperparameters due to the highly non-convex and multi-term structure of the physics-informed loss. In this setting, the outer-loop hyperparameter search is a noisy and black-box optimization problem over heterogeneous parameters, where classical local or gradient-based strategies are easily trapped in suboptimal regions. Evolutionary algorithms, with their population-based exploration and ability to handle mixed, non-differentiable search spaces, provide a more robust mechanism for discovering promising configurations. We propose and investigate a two-stage approach based on evolutionary algorithms that combines exploration and exploitation parts of PINNs training to improve solution accuracy and robustness under fixed computational budgets. In the first stage, we perform low-fidelity training runs with truncated epochs to rapidly screen candidate configurations, treating hyperparameter selection as a black-box outer-loop problem. In the second stage, only the most promising candidates are fully trained with standard gradient-based optimizers to refine the solution. Evaluated on three popular problems, namely Advection, Klein-Gordon and Helmholtz equations, our method consistently outperforms standard training and achieves significantly lower mean error within constrained computational resources.
[LG-11] Sparsity Superposition and Forgetting: A Mechanistic Study of Representation Retention in Continual Learning
链接: https://arxiv.org/abs/2606.20431
作者: Jan Wasilewski,Jędrzej Kozal,Michał Woźniak,Bartosz Krawczyk
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continual learning (CL) systems often forget previously acquired knowledge, yet the mechanisms driving forgetting remain hard to isolate in practice because real datasets entangle many factors. We present a controlled, toy-world framework that makes these mechanisms observable and testable. Using a synthetic generator-separator pipeline, we define ground-truth latent features, build tasks with tunable sparsity and overlap, and introduce measurable quantities for representation strength and superposition (directional overlap among features). We then study retention dynamics-the temporal change of representation strength by fitting sparse dynamical relations (via SINDy) between retention, superposition, and exposure history. A complementary task-level analysis based on effective rank characterizes how representational capacity is allocated across tasks. Our controlled experiments yield three takeaways. (1) Superposition tends to increase over time with transient dips at task boundaries, suggesting boundary-specific interference rather than steady drift. (2) Higher feature sparsity induces more superposition yet does not inevitably cause forgetting; when representations remain strong, forgetting can be reduced despite overlap. (3) Task-level effective rank grows with sparsity, indicating broader capacity usage under sparse regimes. Together, these results nuance the common intuition that more superposition leads to more forgetting by showing that overlap interacts with representation strength and capacity allocation. Our toy analysis provides falsifiable hypotheses and diagnostic tools for CL.
[LG-12] Neural network surrogates with uncertainty quantification for inverse problems in partial differential equations
链接: https://arxiv.org/abs/2606.20417
作者: Christian Jimenez-Beltran,Aretha L. Teckentrup,Antonio Vergari,Konstantinos C. Zygalakis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inverse problems for differential equations arise throughout science and engineering, where one seeks to infer unknown model parameters from noisy or incomplete observations. Traditional numerical methods for these problems are often computationally expensive, particularly in Bayesian settings where evaluating the likelihood becomes costly for complex forward models and high-dimensional parameter spaces. To address this challenge, we introduce DeepGaLA, a neural-network surrogate for differential equation solvers that provides uncertainty-aware predictions, reducing overconfident inference when training data are limited. To evaluate the fidelity of the surrogate-induced posterior approximations in practice, we show that a short run of delayed-acceptance Markov chain Monte Carlo can serve as an effective diagnostic. Across a range of numerical experiments, DeepGaLA delivers forward-model approximations with accuracy comparable to established Gaussian-process surrogates, while better maintaining efficiency as parameter dimension grows. Moreover, it can incorporate differential-equation constraints, including in nonlinear settings. Overall, these results indicate that uncertainty-quantified neural surrogates can enable scalable and reliable Bayesian inference for inverse problems in complex systems.
[LG-13] Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids
链接: https://arxiv.org/abs/2606.20415
作者: Farhin Farhad Riya,Shahinul Hoque,Yingyuan Yang,Jinyuan Sun,Kevin Tomsovic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep Neural Networks DNNs have achieved remarkable accuracy in various tasks including their application in CyberPhysical Systems CPS for detecting False Data Injection Attacks FDIA during critical operations However the unique infrastructure of CPS makes DNNs vulnerable to exploitation by attackers aiming to evade detection Additionally the distinct nature of CPS presents challenges for conventional defense mechanisms against FDIA This paper proposes an innovative defense framework that strengthens DNNs against such attacks by introducing an additional input layer that performs padding in the input samples using pseudofeature values derived from the inputs statistical distribution This padding increases the input dimensionality in a randomized and dataaware manner making adversarial attacks computationally infeasible due to the nontransferable nature of crafted perturbations and the unpredictability of the padded structure Our method is lightweight modelagnostic and requires no modifications to the core architecture making it highly deployable in realworld CPS settings We evaluated our framework on critical power grid applications such as state estimation using the IEEE 14bus 30bus 118bus and 300bus systems Experiments under adversarial settings demonstrate that our padding strategy significantly improves model robustness with negligible impact on performance and effectively mitigates attacks that would otherwise bypass conventional defenses
[LG-14] Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning
链接: https://arxiv.org/abs/2606.20411
作者: Hsiao-Ru Pan,Bernhard Schölkopf
类目: Machine Learning (cs.LG)
*备注: Accepted at RLC2026
Abstract:Direct Advantage Estimation (DAE) has been shown to improve the sample efficiency of deep reinforcement learning algorithms. However, its reliance on full environment observability limits its applicability in realistic settings, and its requirement to model transition probabilities incurs substantial computational overhead for high-dimensional observations. In the present work, we address both limitations. First, we extend the theoretical framework of DAE to partially observable domains with minimal modifications. Second, we reduce its computational complexity by introducing discrete latent dynamics models that efficiently approximate transition probabilities. We evaluate our approach on the Arcade Learning Environment and find that DAE scales effectively with function approximator capacity while retaining high sample efficiency.
[LG-15] he Significance of Style Diversity in Annotation-Free Synthetic Data Generation
链接: https://arxiv.org/abs/2606.20400
作者: Zahra Abbasiantaeb,Zeno Belligoli,Omar Essam,Mohammad Aliannejadi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generating high-utility synthetic data for intent classification typically requires human-annotated seed data, which is often unavailable in fast-paced industrial settings. In this paper, we propose a framework for synthetic dialogue generation that works entirely without human-annotated data, relying solely on intent definitions. Our proposed dialogue generation framework utilizes two different types of topic and style attributes to improve data diversity. Also, we propose two novel post-hoc stylization models called Univ and Exam to transform synthetic LLM-generated utterances into more varied, human-like linguistic styles. To enhance data quality, we utilize an LLM-as-a-judge filtering process. Experimental results on both industrial and public datasets demonstrate that the proposed approach achieves up to 93.3% of the performance obtained using human-annotated training data. Crucially, the findings reveal that style diversity is more critical than topic diversity for synthetic data utility, as it prevents models from learning spurious stylistic correlations. Furthermore, the study shows that incorporating style attributes during the generation process is more effective than post-hoc style adaptation.
[LG-16] owards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach
链接: https://arxiv.org/abs/2606.20382
作者: Zhengyu Wu,Hongchao Qin,Xunkai Li,Zekai Chen,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:MultiModal Federated Graph Learning (MM-FGL) offers a natural collaborative training paradigm, but its practical deployment is challenged by two granularities of modality imbalance. Client-level imbalance occurs when certain clients lack entire modalities, while node-level imbalance occurs when individual nodes exhibit missing visual or textual attributes. While several relevant studies exist, our investigation reveals that they predominantly target graph-agnostic or centralized scenarios, rendering them difficult to adapt directly. To address these challenges, we formalize modality-imbalanced MM-FGL as an implicit graph-aware latent semantic representation synthesis problem. This paradigm recovers missing modal semantics directly within the representation space, thereby maximizing alignment with the original data’s semantic distribution and mitigating the high variance induced by missing modalities. To this end, we propose FedMGS (Federated Modality-aware Graph Synthesis), which integrates three core components. The availability-aware graph encoder prevents missing modalities from contaminating local structural propagation. The prototype-guided latent semantic synthesizer establishes cross-client semantic anchors for unavailable modalities. The reliability-calibrated semantic fusion mechanism regulates the impact of recovered latent representations prior to predictive readout. Extensive experiments on four tasks show that FedMGS consistently outperforms competitive baselines with gains up to 17.41% with best efficiency-performance tradeoff.
[LG-17] Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation
链接: https://arxiv.org/abs/2606.20364
作者: Ali Asaria,Tony Salomone,Deep Gandhi
类目: Machine Learning (cs.LG)
*备注:
Abstract:A companion study established a de-biased, cross-model VLM-as-3D-judge that reliably ranks single-image-to-3D mesh quality where cheap geometry and CLIP proxies fall short. This paper asks: can that judge’s preferences specialize a strong open generator, TRELLIS, on one asset class (furniture), cheaply and without human labels? Taking the judge from ranking to optimization is where the work lives. Pushing a VLM judge into the training and evaluation loop exposes failure modes ranking never triggered, so our contribution is an optimization-grade hardening of the judge: a training judge (Qwen2.5-VL-7B) held distinct from an evaluation judge (InternVL3-8B) to break circularity; position-bias correction; and fixes for three failure modes (image overload, geometry-hiding splat renders, and reference-free judging that rewards clean-but-wrong outputs), with calibration evidence (clear-gap win-rate 0.83-1.0; base-vs-base ~0.5). Using this protocol as an independent evaluator, and working only from public models and data with lightweight parameter-efficient adaptation, we find our methods match the strong base rather than exceed it. Independent base samples carry essentially no learnable preference (0.94 order-flip rate), so signal must be engineered by quality-contrastive construction. Across six adaptation methods, two input regimes, and a severity sweep, the most targeted - conditioner repair under severe degradation - reaches parity (0.50) with the base, while no method clears the =65% win-rate target. The result is mechanistic: clean inputs saturate the judge, flow-DIT fine-tuning washes out through the sampler, and conditioning repair is the locus that moves geometry. Win-rates are directional at n=8 objects. Matching a strong public-data base with cheap adaptation is itself informative: exceeding it needs more than lightweight PEFT on public data, and the judge protocol is reusable.
[LG-18] rain Retrieve or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act
链接: https://arxiv.org/abs/2606.20359
作者: Ali Asaria,Tony Salomone,Deep Gandhi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Self-represented tenants, landlords, and help-desk staff need to be pointed at the provision of law that actually governs a question, with a correct statutory citation. We study this task on the Ontario Residential Tenancies Act, 2006 (RTA) and its core regulation, asking the operator’s question empirically: is fine-tuning enough, or is hybrid retrieval needed? We run a four-arm head-to-head on Qwen2.5-7B-Instruct (base zero-shot, LoRA SFT-only, RAG-only, and an SFT+RAG hybrid), scored on citation exact-match (section+subsection) over a small, human-verification-pending real eval set. The base model cannot cite the RTA and SFT-only mis-recalls sections; retrieval is essential and drives hallucination to zero by construction; and the SFT+RAG hybrid scores highest at 0.481 exact-match with zero hallucinated citations. Its edge comes from SFT making provision selection more robust to the higher-recall candidate sets that hurt zero-shot RAG. Notably, this cheap bge-small hybrid matches or beats a pipeline built on bigger, specialized retrieval models (a larger embedder and a cross-encoder reranker), and a larger/improved training set does not help either: strong statutory-citation performance here does not require specialized retrieval models or more data. The artifact zeroes hallucination and clears the lift-over-base bar but does not reach the aspirational 0.70 exact-match target. All results are on a small, human-verification-pending real eval set and are reported as preliminary.
[LG-19] On the Variance of Temporal Difference Learning and its Reduction Using Control Variates
链接: https://arxiv.org/abs/2606.20357
作者: Hsiao-Ru Pan,Bernhard Schölkopf
类目: Machine Learning (cs.LG)
*备注: Accepted at RLC2026
Abstract:We analyze the variance of temporal difference (TD) learning using the phased setting with tabular representation, and show that one of the mechanisms behind its ability to reduce variance is by effectively aggregating over a larger number of independent trajectories. Based on this insight, we demonstrate that (1) the variance of TD is asymptotically bounded from above by Monte Carlo (MC) estimators, and (2) shorter horizon updates incurs less variance for a fixed number of samples. Beyond TD, we show that Direct Advantage Estimation (DAE), a method for estimating the advantage function, can be seen as a type of regression-adjusted control variate, which achieves a tighter bound on the variance compared to TD in the large-sample limit. Finally, we numerically illustrate the behaviors of these estimators with carefully designed environments.
[LG-20] Critical Percolation as a Synthetic Data Model for Interpretability ICML2026
链接: https://arxiv.org/abs/2606.20347
作者: Aryeh Brill,Tom Ingebretsen Carlson
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 21 pages, 10 figures, accepted to the Mechanistic Interpretability Workshop at ICML 2026
Abstract:Neural networks learn features that reflect the hierarchical, multi-scale structure of natural data. Synthetic datasets used to evaluate interpretability methods typically lack this structure, limiting their value as realistic toy models. To close this gap, we introduce a family of synthetic datasets consisting of hierarchical functions defined on critical mean-field percolation clusters embedded in a high-dimensional data space. The percolation data consists of sparse, low-dimensional fractal clusters with a power-law size distribution. Latent variables modeling a taxonomic hierarchy generate each data point’s target value. The data model is analytically tractable with known critical exponents that fix its properties without requiring hyperparameter tuning. We leverage a mapping between percolation clusters, random trees, and additive coalescence to propose an almost linear-time algorithm to jointly sample a random tree and its hierarchical latent decomposition, enabling data generation at arbitrary scale. Using probing experiments, we find that the model’s ground-truth latent variables can be linearly decoded from neural network activations. Together, sparsity, self-similarity, power-law statistics, and analytical tractability make critical percolation a principled testbed for interpretability research.
[LG-21] Constrained hybrid modelling to predict microbial dynamics and organic matter turnover in soil systems ICML’26
链接: https://arxiv.org/abs/2606.20329
作者: Paul Collart,Juergen Gall,Andrea Schnepf,Holger Pagel,Lars Doorenbos
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: Accepted at ICML '26
Abstract:Soil microorganisms control organic matter cycling and largely determine how soil systems can cope with and mitigate climate change and environmental threats. Representing microbial dynamics in process-based soil models is therefore critical to predict carbon cycling in soils, albeit highly challenging to inform from data. One promising approach to improve their parametrisation is the integration of genomic data, yet modelling the complex and unknown relationship between genomes and the processes the microbes are driving is an unsolved problem. In this work, we present the first hybrid modeling framework for deriving biokinetic parameter values of a process-based soil organic matter turnover model from metagenome-inferred functional traits based on DNA sequencing data. Our model predicts biokinetic parameters of the process-based model from genomic trait data with a neural network and integrates constraints from ecological theory and literature to ensure realistic behavior, even of non-observed state variables. We evaluate our method on synthetic genomic trait datasets of varying complexity and on real data, showing that our approach improves performance over multiple baselines and learns the dynamics of unmeasurable components of the process-based model effectively, even for small training datasets.
[LG-22] Quantum-classical physics-informed Kolmogorov-Arnold networks for PDEs
链接: https://arxiv.org/abs/2606.20326
作者: Xiang Rao,Yuxuan Shen
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:We develop QCPIKAN, the first quantum-classical physics-informed Kolmogorov-Arnold network designed to solve partial differential equations (PDEs). Built upon Chebyshev-polynomial KAN layers and parameterized quantum circuits, this hybrid framework embeds physical constraints into the training loss to enforce physical consistency. Our theoretical investigations grounded in approximation theory prove that this design accelerates high-frequency error convergence to an exponential rate and effectively mitigates numerical dispersion. We validate the framework across three typical seepage scenarios in porous media, including single-phase flow, component transport and two-phase flow. Compared with existing quantum-classical physics-informed neural networks, QCPIKAN achieves superior performance in global prediction accuracy, local error control, dynamic evolution tracking and displacement front localization. This work provides a robust and efficient alternative for solving complex PDEs.
[LG-23] Recurrent neural networks approximate continuous functions
链接: https://arxiv.org/abs/2606.20325
作者: Valentin Abadie,Clemens Hutter,Helmut Bölcskei
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC); Dynamical Systems (math.DS)
*备注:
Abstract:Classical approximation theorems ask for a new neural network whenever the target accuracy is improved. This paper studies the opposite possibility: can the network be chosen once and for all, and can accuracy be bought only by letting it run longer? We prove that this is possible for every continuous function on [-1,1]. More precisely, each such function is uniformly approximated by the time evolution of a single ReLU recurrent neural network with fixed weights and fixed hidden dimension. The mechanism behind the construction is a new intermediate model, the Turing machine with neural units (TMNU). This model retains the algorithmic freedom needed to implement polynomial approximation schemes, while remaining rigid enough to be simulated by RNNs with explicit bounds on hidden dimension and weight magnitude. The resulting convergence rates reflect the underlying polynomial approximation rates. We complement the construction with minimax lower bounds showing that runtime is not merely a proof artifact, but an unavoidable resource in this fixed-network approximation paradigm.
[LG-24] A Model-Driven Approach for Developing Families of Reinforcement Learning Environments
链接: https://arxiv.org/abs/2606.20324
作者: Xiaoran Liu,Istvan David
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Virtual training environments are software-intensive systems in which reinforcement learning (RL) agents learn, adapt, and demonstrate meaningful behavior. Virtual training environments offer a safe and cost-efficient alternative to training agents in real-world settings. However, to converge, most realistic RL problems require training in multiple, mostly similar but slightly different environments - i.e., families of environment variants. The typical development process of environment families is a labor-intensive and error-prone manual endeavor that does not scale well. To alleviate these issues, in this paper, we propose a model-driven approach for developing families of RL training environments. To obtain the family of environments, we develop an approach and prototype tool. In our approach, a hybrid genetic algorithm - a combination of population-based global search and heuristic local search - generates environment families. Mutations and constraints are expressed as model transformations and are operationalized into a search process by a state-of-the-art model transformation engine. We demonstrate the soundness of our approach in a wildfire mitigation scenario and curriculum learning - a particular learning paradigm that relies on environment families.
[LG-25] Shifting-based Optimizable Linear Relaxations for General Activation Functions
链接: https://arxiv.org/abs/2606.20292
作者: Philipp Kern,László Antal,Erika Ábráham,Carsten Sinz
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 21 pages, under review
Abstract:The use of neural networks (NNs) is rapidly increasing, including in safety- and security-critical domains. To provide formal guarantees about NN behavior, many verification methods rely on optimizable linear relaxations of activation functions. However, existing techniques depend on hand-crafted relaxations for each activation function. Extension to state-of-the-art activation functions therefore requires substantial manual effort. In contrast, our approach SLiR (Shifting-based Linear Relaxations) is broadly applicable, requiring only a Lipschitz constant or a set of critical points. SLiR parameterizes relaxations by their slope and computes the corresponding offset via a shifting procedure that ensures sound upper and lower bounds over the input domain, enabling efficient optimization while maintaining correctness. Our experiments show that SLiR produces tight relaxations across a wide range of practical activation functions and enables verification of up to 7.8x more properties compared to state-of-the-art methods.
[LG-26] Effective Dimension Governs Generalization in Quantum Kernel Vision Models
链接: https://arxiv.org/abs/2606.20183
作者: Jian Xu,Delu Zeng,John Paisley,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent quantum vision models-quantum vision transformers and quantum convolutional networks-report two striking but unexplained empirical phenomena: (i) ansatze with more, or more uniformly distributed, entanglement generalize better, and (ii) injecting quantum noise can improve test accuracy rather than degrade it. These observations are currently treated as curiosities, discovered by grid search and explained, if at all, by hand. We show that both are manifestations of a single, measurable quantity: the \empheffective dimension d_\rm eff of the (noise-shaped) quantum feature kernel. Working primarily with quantum-kernel vision models-a quantum feature map read out by a kernel classifier-we give a spectral account in which entanglement structure and quantum noise are two knobs that move d_\rm eff ; in an overfitting regime, contracting d_\rm eff acts as ridge-like regularization. We analyze the mechanism: an \emphexact decomposition of the depolarized kernel K_p=(1-p)^2K+\tfracp(2-p)D\mathbf1\mathbf1^\top with d_\rm eff(K_p)\to1 , a contraction result (and its boundary) for amplitude damping, a kernel-machine capacity bound, and a capacity/alignment risk decomposition; the monotone contraction operative in our entangled experiments is verified empirically, not proven in general. Along the one-parameter depolarizing family the collapse is instead exact by construction; we use it only to confirm the kernel decomposition to machine precision and at up to 12 qubits, not as evidence for d_\rm eff . Amplitude damping contracts d_\rm eff and lifts test accuracy by up to +13% along an inverted-U sweet spot; the effect’s sign flips between the over- and under-fitting regimes; noise injection matches an explicit spectral-filtering frontier. Our results organize two reported anecdotes into a single measurable principle for designing quantum-vision models.
[LG-27] Computational Methods and Challenges in Cell-Free DNA Analysis for Multi-Cancer Early Detection
链接: https://arxiv.org/abs/2606.20174
作者: Nicko Starkey,Marcin W. Wojewodzic,Krzysztof Rzecki
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cell-free DNA (cfDNA) is a promising avenue for non-invasive multicancer early detection (MCED), in that, it can enable multiple cancer detection simultaneously from a single blood draw, with particular sensitivity to cancers that currently lack established screening programs. Here we review the computational methods developed between 2022 and 2025 for cfDNA-based MCED. We focus on how fragmentomics and epigenetic features are extracted and analyzed to detect cancer at early stages. We first briefly outline the biological basis of cfDNA signals, then review classical statistical and machine learning approaches alongside deep learning frameworks including autoencoder-based models. For each method we discuss biological interpretability, validation strategy, and readiness for clinical integration. Furthermore, we categorize the current challenges into technical, computational, and methodological while outlining open problems in the field. This review shows that multimodal ensemble approaches have the strongest promise for clinical integration and the highest readiness. However, for better assessment of future work and side-by-side comparison, standardization of evaluation protocols and reporting results will be crucial.
[LG-28] Predicting gestational age at birth in the context of preterm birth from multi-modal fetal MRI
链接: https://arxiv.org/abs/2606.20172
作者: Diego Fajardo-Rojas,Megan Hall,Daniel Cromb,Mary A. Rutherford,Lisa Story,Emma C. Robinson,Jana Hutter
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL
Abstract:Preterm birth is associated with significant mortality and a risk for lifelong morbidity. The complex multifactorial aetiology hampers accurate prediction and thus optimal care. A pipeline consisting of bespoke machine learning methods for data imputation, feature selection, and regression models to predict gestational age (GA) at birth was developed and evaluated from comprehensive multi-modal morphological and functional fetal MRI data from 333 control cases and 93 preterm birth cases. The GA at birth predictions were classified into term and preterm categories and their accuracy, sensitivity, and specificity were reported. An ablation study was performed to further validate the design of the pipeline. Performance was evaluated using stratified 10-fold cross-validation. The pipeline achieves an R2 score of 0.13 and a mean absolute error of 2.74 weeks. It also achieves a 0.77 accuracy, 0.59 sensitivity, and 0.82 specificity across folds. The predominant features selected by the pipeline include cervical length and statistics derived from placental T2* values. The confluence of fast, motion-robust and multi-modal fetal MRI techniques and machine learning prediction allowed the prediction of the gestation at birth. This information is essential for any pregnancy. To the best of our knowledge, preterm birth had only been addressed as a classification problem in the literature. Therefore, this work provides a proof of concept. Future work will increase the cohort size to allow for finer stratification within the preterm birth cohort. Our code is available at this https URL.
[LG-29] Multi-Modal Contrastive Learning for Implicit Earth Embeddings via Location Tying
链接: https://arxiv.org/abs/2606.20167
作者: Jonathan Hecht,Lukas Arzoumanidis,Ziyue Li,Youness Dehbi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spatial prediction tasks are often limited by a lack of high-quality labelled ground-truth observations. To overcome this challenge, self-supervised pre-training is a possible solution, with contrastive learning dominant for location encoders. Those approaches usually align geographic coordinates with just one additional modality. We propose two multimodal contrastive learning architectures: Multimodal Embedding via Location Tying (MELT) and Sequential Alternating Location Training (SALT). These architectures expand this framework beyond two modalities by utilising unpaired geospatial data. Both methods are technically viable and match the performance of the strongest two-modality baseline (SATCLIP) across four downstream tasks. However, increasing the number of modalities does not consistently improve performance, suggesting that the chosen location encoder is the main limitation - the contrastive objective reaches its peak early, regardless of modality diversity or pre-training volume. MELT provides more stable training than SALT and presents a stronger foundation for future scaling.
[LG-30] he Correctness Illusion in LLM -Generated GPU Kernels
链接: https://arxiv.org/abs/2606.20128
作者: Dipankar Sarkar
类目: oftware Engineering (cs.SE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 10 pages, 2 figures, LNCS format. Companion papers to follow on arXiv next week; IDs will be added in a v2 replace
Abstract:Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.
[LG-31] Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation
链接: https://arxiv.org/abs/2606.20118
作者: Jonghoon Lee,Seong Hyeon Park,Byungwoo Jeon,Minha Lee,Jinwoo Shin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy’s own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successful episode already encodes a physically valid action trajectory together with calibrated multi-view observations. By swapping only the manipulated object while preserving this trajectory, we obtain new and physically grounded demonstrations. However, naive 2D video editing breaks multi-view consistency and physical plausibility, particularly under heavy occlusion and egocentric viewpoints. Our method instead operates directly in 3D, anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent renderings across all camera views. Fine-tuning a VLA on data augmented by our method improves success rates by 16.5% relative to the state-of-the-art baseline on novel objects, while preserving in-distribution performance. These results show that multi-view and physically consistent augmentation is a practical path to scalable VLA generalization.
[LG-32] Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning
链接: https://arxiv.org/abs/2606.20107
作者: Asaf Cassel,Aviv Rosenberg
类目: Machine Learning (cs.LG)
*备注:
Abstract:Optimal Reinforcement Learning (RL) algorithms typically rely on carefully constructed count-based uncertainty estimates to drive exploration. Although theoretically sound, such estimates are hard to compute in practical settings and therefore offer limited insight for designing exploration heuristics. Meanwhile, ensembling has emerged as a practical approach, but remains without theoretical justification. Building on a recent ensemble-based method for Multi-Armed Bandits, we propose a quantile-based ensemble method for finite-horizon Markov Decision Processes (MDPs). Our simple count-free approach achieves optimal variance-dependent regret bounds, providing theoretical grounding for ensemble-based exploration in RL.
[LG-33] PaAno: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection
链接: https://arxiv.org/abs/2606.20055
作者: Youji Zhu,Hongbing Wang,Wenchao Liu,Xiaodong Liu,Xiangguang Xiong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time-series anomaly detection has significant practical value for industrial and medical monitoring, as well as other critical domains. Current Transformer- and large-model-based detection approaches incur excessive computational overhead, while existing lightweight alternatives are constrained by insufficient feature extraction and inadequate modeling of dependencies across multivariate variables. To mitigate the above drawbacks, this study develops a lightweight, efficient anomaly detection model, dubbed PaAno, within the patch-oriented representation learning paradigm. In the encoder module, a multiscale feature-extraction backbone is constructed using convolutional kernels with differentiated receptive fields to capture hierarchical temporal characteristics; subsequent cross-scale adaptive attention aggregation, combined with residual connection optimization, further stabilizes feature representation learning. A cross-variable fusion attention module is embedded to explicitly characterize inter-variable correlations, empowering the model to identify anomalous patterns amid intricate operational conditions. Moreover, a novel pretext task based on temporal patch-window sorting is customized to uncover intrinsic structural properties of time series, and triplet loss is leveraged to optimize the patch embedding space for enhanced feature discrimination. Extensive experiments on the TSB-AD benchmark demonstrate that the proposed PaAno achieves state-of-the-art detection accuracy on both univariate and multivariate tasks, yielding significant performance gains across evaluation metrics, including VUS-PR, relative to the original PaAno. Leveraging a compact network design, the presented model achieves favorable computational efficiency, enabling deployment on resource-limited terminals for real-time anomaly inference.
[LG-34] Comparative Study of Neural Surrogate Architectures for Autoregressive Prediction of Internal Battery States
链接: https://arxiv.org/abs/2606.20053
作者: Gihyun Lee,Thorben Menne,Simon Olma,Jakob Hilgert,Sangyoung Park
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures
Abstract:The Doyle-Fuller-Newman (DFN) model resolves internal electrochemical states in lithium-ion batteries with high fidelity. However, the numerical solution of its governing equations is computationally prohibitive for real-time deployment, limiting scalability from individual cells to pack and fleet-scale applications. While machine learning surrogates can substantially reduce inference latency through GPU acceleration, most existing approaches learn solution approximations tied to specific operating conditions rather than learning generalizable state-evolution dynamics. This work presents a systematic comparison of four neural network architectures (MLP, ResNet, U-Net, FNO) formulated as autoregressive state-transition operators that predict full DFN internal states across a wide range of operating conditions. To ensure a controlled architectural comparison, all models are trained under a unified framework using multi-step unrolling and current-conditioning, isolating the impact of spatial inductive bias. Results demonstrate that the U-Net’s multi-scale feature hierarchy achieves a mean final-step nRMSE of 3% averaged across all internal state variables after 300-step autoregressive rollouts, while providing a 5.38x speed-up over the numerical solver. These findings highlight spatial inductive bias as a critical determinant of surrogate performance, advancing the development of surrogates for internal state observability for next-generation battery management systems and digital twins.
[LG-35] Alzheimers Disease Diagnosis using a Multimodal Approach with 3D MRI and PET
链接: https://arxiv.org/abs/2606.20037
作者: Loukas Ilias,Anthi-Maria Vozinaki,Christos Ntanos,Dimitris Askounis
类目: Machine Learning (cs.LG)
*备注: 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Abstract:Alzheimer’s disease (AD) is an irreversible neurodegenerative disorder and a leading cause of death worldwide. Early diagnosis plays an important part especially at the Mild Cognitive Impairment stage, where timely intervention can help slow its progression before it advances to AD. Neuroimaging data, like Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) scans, can help detect brain changes early by providing structural and functional brain changes related to the disease. Yet, many multimodal models still fuse MRI and PET with static concatenation and apply identical computation to all subjects, which limits robustness to patient/site heterogeneity and can waste computation. To address these limitations, we present the first study of combining 3D convolutional feature extractors with three fusion strategies - concatenation, Gated Multimodal Unit (GMU), and gated self-attention - and a sparsely gated Mixture-of-Experts (MoE) classifier that performs input-adaptive routing, activating only the most informative experts per case. Finally, we utilize Grad-CAM to visualize disease-related regions, ensuring model interpretability. Experiments are performed across three binary classification tasks (NC vs. MCI, MCI vs. AD, and NC vs. AD). Results show that GMU achieves accuracies of 80.46 % (NC vs. MCI) and 95.47 % (NC vs. AD), while gated self-attention attains 82.08 % on MCI vs. AD. Ablations show that removing the MoE consistently degrades accuracy across all tasks. These findings underscore the value of input-adaptive, multimodal modeling for AD diagnosis by leveraging the complementary nature of MRI and PET.
[LG-36] Exploring the potential of AlphaEarth and TESSERA embeddings for Fine-scale Local Climate Zone Mapping: A case study across five cities in Switzerland
链接: https://arxiv.org/abs/2606.20034
作者: Htet Yamin Ko Ko,Clement Atzberger
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding urban spatial morphology is critical for climate modeling, risk assessment, and sustainable urban design, and Local Climate Zone (LCZ) mapping provides the basic framework for this. However, many cities still use coarse ~100-m resolution LCZ records, which are unsuitable for fine-scale urban research. In this study, precomputed embeddings from TESSERA (Feng et al., 2025) and AlphaEarth (Brown et al., 2025) are compared to traditional Sentinel-1/2 (S1S2) composites in five Swiss cities to see if they can upscale coarse LCZ maps to 10-m resolution using an attention-based U-Net. Three experiments assess multi-city transferability, the impact of higher-resolution reference data, and temporal robustness to year-to-year phenology changes. We find that all datasets achieve strong performance with test data Intersection-over-Union (IoU) ranging from 0.59-0.69 and 0.77-0.82 in the first two experiments. TESSERA consistently outperforms both S1S2 and AlphaEarth across both settings As expected, we find that the transfer of embedding-based models from one year to another remains an open challenge. Overall, however, our results demonstrate the promising potential of embeddings derived from EO foundation models to reduce time consuming preprocessing, respectively, manual feature engineering tasks and to guide a universal deep learning-based LCZ mapping workflow. When combined with a simple location-aware attention U-Net architecture, the embeddings enhance regional transferability and scalability, supporting the development of comprehensive and reproducible fine-scale LCZ maps for global urban climate applications Improving reference data quality remains the strongest lever for further accuracy gains.
[LG-37] Adaptive Distance-Aware Trunk Deep Operator Learning for Long-Span Roadway Bridges
链接: https://arxiv.org/abs/2606.20015
作者: Bilal Ahmed,Diab W. Abueidda,Waleed El-Sekelly,Tarek Abdoun,Mostafa E. Mobasher
类目: Machine Learning (cs.LG)
*备注: 39 pages, 26 figures
Abstract:Long-span roadway bridges exhibit highly localized structural responses under vehicular loading, making repeated FE analysis computationally expensive for applications such as influence surface generation and structural digital twins. Existing SciML approaches struggle to accurately capture these localized responses. To address this challenge, this study proposes an adaptive-trunk DeepONet for localized structural response prediction in large-scale bridge systems. The framework dynamically constructs a load-dependent learning domain using a KNN strategy, allowing the network to focus on structural influence zones. The trunk network is further enhanced using distance-aware features that encode the geometric relationship between the load and structural nodes. A physics-based full-field reconstruction is incorporated through a stiffness-informed Schur complement formulation, enabling predictions at adaptive nodes to be extended to the entire structural domain. To enable scalable training, response data are generated using a reduced-order equivalent shell model that preserves the dominant global behavior while significantly reducing computational cost. The proposed framework is validated on both a benchmark bridge model and the real-world Mussafah Bridge. Results show that the method achieves FEM-level accuracy with relative errors below 5%, while reducing the total response evaluation time (including full-field reconstruction) by approximately 60x; excluding the post-processing reconstruction step, the AD-DeepONet inference is up to four orders of magnitude faster than FEM. In addition, the framework enables rapid generation of full-field responses, influence lines, and influence surfaces under arbitrary vehicular loading configurations, demonstrating strong potential for large-scale bridge analysis and digital twin applications.
[LG-38] Self-Adaptive Scale Handling for Forecasting Time Series with Scale Heterogeneity ICASSP
链接: https://arxiv.org/abs/2606.20010
作者: Xu Zhang,Zhengang Huang,Yunzhi Wu,Xun Lu,Erpeng Qi,Yunkai Chen,Zhongya Xue,Peng Wang,Wei Wang
类目: Machine Learning (cs.LG)
*备注: This is the full version of the paper accepted by the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). The code and dataset are available at this https URL
Abstract:Current time series forecasting (TSF) research predominantly focuses on scale-homogeneous data, where different time series share similar numerical magnitude ranges. However, in real-world industrial scenarios such as financial product sales, different time series often differ by orders of magnitude (scale heterogeneity). Since these series share similar temporal patterns, joint modeling is desirable for better data utilization, yet existing scaling methods either compress low-scale signals (global normalization) or destroy semantic discriminability and amplify inverse-scaling errors (window-based scaling). This paper proposes a self-Adaptive Scale-handling (AS) module that learns adaptive scale factors tailored to each input, preserving semantic discriminability while reducing inverse-scaling errors. AS consists of Scale Calibrating (SC), which calibrates prior mean scaling factors through neural networks, and Scaling Selection (SS), which decides whether to apply calibration or retain the original factor, avoiding over-calibration. Experiments on real-world fund sales datasets from Ant Fortune and Alipay show that AS seamlessly integrates into popular TSF models and consistently improves their performance. The code and dataset are available at the link this https URL.
[LG-39] VIMPO: Value-Implicit Policy Optimization for LLM s
链接: https://arxiv.org/abs/2606.20008
作者: Zhewei Kang,Aosong Feng,Sergey Levine,Dawn Song,Xuandong Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.
[LG-40] Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLM s ICML2026
链接: https://arxiv.org/abs/2606.19993
作者: Nico Harder,Daniel Becking,Karsten Mueller,Wojciech Samek
类目: Machine Learning (cs.LG)
*备注: Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM), Seoul, South Korea (non-archival)
Abstract:We present Activation- and Influence-Aware Ranks (AIR), an SVD-based LLM compression framework that guides each weight matrix’s low-rank approximation with a backward-signal influence metric. Starting from the activation-aware optimum of SVD-LLM(W), AIR runs a single closed-form alternating least squares (ALS) sweep that integrates influence element-wise under a monotone-descent guarantee. AIR is layer-local and composes orthogonally with end-to-end methods: alone it exceeds ACIP, and AIR+LoRA outperforms it further. AIR improves perplexity over SVD-LLM(W) by 18% at =60% parameter retention, matches its quality with ~90% less calibration data, and turns parameter savings into FLOP, peak-memory, and per-token latency gains.
[LG-41] Online Dynamic Batching with Formal Guarantees for LLM Training
链接: https://arxiv.org/abs/2606.19989
作者: Dian Li,Zekun Wang,Yaoru Wang,Jiahong Yan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 29 pages, 3 figures, 21 tables
Abstract:Modern LLM training breaks a core assumption behind offline batch samplers: the true training cost of a sample is only observable after preprocessing, augmentation, templating, tokenization, and multimodal visual-token expansion. Unless one pays for a preprocessing- and augmentation-dependent length cache, batch construction is therefore blind to the quantity that determines padding, memory use, and GPU saturation. We introduce Online Dynamic Batching (ODB), a DataLoader-side drop-in system that moves batch formation to this point of accurate observability while preserving DDP step alignment. We formalize this synchronization requirement as the Distributed Group Alignment Problem and prove deadlock-free bounded termination with default join-mode identity coverage and opt-in non-join sample-quota closure. ODB requires no model, optimizer, or attention-kernel changes and is released as online-dynamic-batching with lightweight trainer adapters. Across public 2B/8B Qwen3-VL runs on UltraChat/LLaVA/ShareGPT4o, ODB improves literal emitted-sample throughput vs. fixed-batch Standard by 1.58-2.51x on single-node Full FT/LoRA and 1.71-3.78x on two-node Full FT, with Standard-comparable quality; production MM-Mix reaches 4.43x. Against GMT/BMT offline token-budget oracles, ODB is within 15% on UltraChat/LLaVA and faster on high-CV ShareGPT4o: 2.24-2.39x single-node Full FT/LoRA and 3.06-3.69x two-node Full FT. Together, ODB occupies the online/drop-in regime for high-heterogeneity LLM fine-tuning: large throughput gains at Standard-comparable quality, formal DGAP guarantees, and no length-cache precompute or kernel rewrites.
[LG-42] Kolmogorov-Arnold Reservoir Computing
链接: https://arxiv.org/abs/2606.19984
作者: Juntian Huang,Jurgen Kurths,Ying Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reservoir computing offers a lightweight framework for forecasting dynamical systems but may struggle to capture long-range dependencies due to limited representational capacity. Conventional reservoir computing recurrently uses trainable reservoirs with hyperparameter sensitivity, while the next-generation reservoir computing removes recurrence at the cost of rapidly growing feature dimensions. Here, we develop Kolmogorov-Arnold Reservoir Computing (KARC), which replaces reservoirs with explicit basis-function expansions inspired by the Kolmogorov-Arnold representation theorem. We rigorously show that KARC is a lightweight design of Kolmogorov-Arnold networks (KANs), preserving the potential expressive capacity of KANs while admitting efficient closed-form training of reservoir computing. At comparable cost, KARC outperforms existing reservoir computing methods on challenging benchmarks including partial differential equations. It can also be integrated with generative diffusion models for text-to-image generation. This work thus establishes a principled bridge between reservoir computing and KANs, enabling efficient and high-fidelity dynamical system forecasting.
[LG-43] Low-Energy Reduced RISC-V Instruction Subset Processor for Tsetlin Machine Inference at the Edge
链接: https://arxiv.org/abs/2606.19964
作者: Chanda Gupta,Sanidhya Bhatia,Shaurya Priyadarshi,Himani Panwar,Rishad Shafik,Sudip Roy
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 6 pages, 6 Figures, Accepted in IEEE ISVLSI Conference 2026
Abstract:Tsetlin Machine ™ is a logic-based machine learning approach that relies on simple bitwise operations and finite-state automata, which makes it attractive for edge AI deployments. Recent work has focused on co-processor and accelerator designs based on Tsetlin Machines (TMs). Although these designs achieve high performance, they typically depend on tightly coupled interfaces, microcode-style programming, and external host processors, limiting flexibility and ease of programming. In this work, we present a domain-specific RISC-V microprocessor architecture and design flow tailored for TM inference. Leveraging the modular structure of RISC-V, we design a reduced instruction subset processor that retains programmability while targeting improved performance and lower energy consumption for TM workloads. Instruction profiling is employed to guide instruction reduction, followed by datapath and control path simplifications tailored to TM inference. Both the baseline RV32IM core and the proposed reduced core are evaluated across multiple datasets and compared with Binarized Neural Networks (BNNs), which serve as a hardware-efficient baseline due to their reliance on bitwise operations during inference. Results show that TM achieves comparable or higher accuracy (e.g., up to 88.18% on CIFAR-2 compared to 60.0% for BNN) while reducing execution time by up to 98% across multiple datasets. Furthermore, the proposed design achieves an average 29.7\times reduction in energy consumption, demonstrating its effectiveness for programmable and efficient edge AI systems.
[LG-44] owards Graph-Based Deep Learning for Map Generalization: Insights from Building Footprints Simplification and Aggregation
链接: https://arxiv.org/abs/2606.19956
作者: Yanning Wang,Zhiyong Zhou,Zhouyu Liu,Mengni Yu,Yu Feng
类目: Machine Learning (cs.LG)
*备注: 15 pages, 20 figures, 10 tables
Abstract:Map generalization remains one of the fundamental tasks in cartography, especially for the simplification and aggregation of complex building footprints. This study presents the first exploratory application of graph-based deep learning to both tasks, reformulating simplification as node movement prediction and aggregation as link prediction within a unified graph learning framework. We evaluate representative graph neural network architectures (GCN, GAT, and GraphSAGE) on multi-scale building datasets, showing that GraphSAGE demonstrates relative strengths in link prediction accuracy, while also revealing persistent challenges in precise node movement prediction. Beyond quantitative performance, the results highlight that aggregation poses greater complexity and challenges than simplification, underscoring the difficulty of capturing higher-level spatial relationships in map generalization with current deep learning approaches. Although limitations such as data imbalance and the need for post-processing remain, the study provides valuable insights and methodological directions for advancing automated map generalization with deep learning approaches.
[LG-45] Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds
链接: https://arxiv.org/abs/2606.19941
作者: Dat H. Do,Rushi Shah,Duc V. Le,Dianbo Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Compositionality is believed to be the foundation for generalization, enabling models to reuse meaningful primitives in novel combinations. Yet, models trained with standard gradient-based optimization rarely, and often only weakly, exhibit compositional internal structure, and it remains unclear how or why such compositionality forms. In this work, we show that compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis, compositionality only appears in some specifically sparse networks, heavily depends on which connections remain rather than on weights’ sparsity alone. Along the depth axis, compositionality emerges within a narrow, target-dependent regime, peaking at specific depths, while both shallower and deeper networks fail. When either the depth or connectivity condition is violated, gradient descent silently converges to fractured solutions rather than compositional ones. To discover and exploit this emergence, we introduce (i) similarity-based pruning (SP) to recover compositional connectivity and (ii) a heuristic depth predictor to estimate where compositionality is most likely to appear. Finally, we support these empirical findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds, explaining why compositional solutions are reachable only in a narrow depth-connectivity regime.
[LG-46] ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models
链接: https://arxiv.org/abs/2606.19919
作者: Tingyun Li,Zishang Jiang,Jinyi Han,Xinyi Wang,Sihang Jiang,Han Xia,Zhaoqian Dai,Shuguang Ma,Fei Yu,Jiaqing Liang,Yanghua Xiao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency-performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency-performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.
[LG-47] Structure-Oriented Randomized Neural Networks for Poisson-Nernst-Planck and Poisson-Nernst-Planck-Navier-Stokes Systems
链接: https://arxiv.org/abs/2606.19912
作者: Yunlong Li,Fei Wang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:We develop a structure-oriented randomized neural network framework, termed SO-RaNN, for the Poisson-Nernst-Planck (PNP) system and the Poisson-Nernst-Planck-Navier-Stokes (PNP-NS) system. The decoupled linearized subproblems are solved iteratively by randomized neural networks in a space-time framework. For the concentration variables, a pointwise cut-off is used to enforce positivity at the value level, and discrete mass-scaling factors are computed at selected correction instants and interpolated in time, so as to ensure exact mass matching at those instants and to promote approximate mass preservation between them. To introduce an auxiliary discrete dissipation mechanism, we further employ an SAV-type post-processing correction, which yields monotonicity of the SAV auxiliary variable under the ideal SAV update. For the PNP-NS system, a structure-preserving randomized neural network (SP-RaNN) is used for the velocity field, so that the velocity approximation satisfies the incompressibility constraint pointwise by construction. On the theoretical side, we derive residual-based estimates for the raw, uncorrected RaNN solvers of the linearized subproblems, formulate a conditional local-in-time convergence result for the raw outer Picard iteration of the PNP system, and analyze the value-level positivity correction together with the mass-correction and SAV post-processing steps. For the PNP-NS system, we establish an approximation result for the SP-RaNN space and provide a conditional error statement for the corresponding linearized Oseen-type problem. Numerical experiments demonstrate approximation accuracy in the source-driven manufactured tests and illustrate the intended value-level positivity correction, selected-time mass matching, computed free-energy curves based on the final gauge-fixed potential, and divergence-free approximation in benchmark tests.
[LG-48] A fast direct solver based neural network for solving PDEs
链接: https://arxiv.org/abs/2606.19895
作者: Jashwanth Reddy Kadaru,Vaishnavi Gujjula
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 26 pages, 7 Figures, 5 Tables
Abstract:The matrices arising from large scale N -body problems can be efficiently represented using hierarchical matrices, whose key idea is that the admissible off-diagonal sub-matrices can be well approximated by low-rank matrices across a hierarchy of matrix partitions. HODLR (Hierarchical Off-Diagonal Low-Rank) matrices are a subclass of hierarchical matrices in which all off-diagonal submatrices at every level of a recursive binary partition are low-rank. In this article, we present a neural network that learns the inverse operation of HODLR matrices based on the fast direct solver for HODLR matrices developed by Ambikasaran and Darve (2013). We further extend the architecture to learn nonlinear solution operators associated with PDEs by replacing some of the linear layers with deep sub-networks. We demonstrate the performance of the proposed architecture by performing a comprehensive set of experiments that include (i) solving a linear problem such as the Fredholm integral equation of the second kind, (ii) solving PDEs such as the nonlinear Schrödinger equation, Burgers’ equation, and the steady-state Darcy’s flow equation, (iii) generalization study across varying parameter values, (iv) comparing the inference time of the proposed network with the run time of a classical numerical solver, and (v) comparing the proposed network with some of the existing neural operator learning networks.
[LG-49] Score Approximation for Diffusion Models on Arbitrary Low-Dimensional Structures
链接: https://arxiv.org/abs/2606.19894
作者: Xinhe Mu,Zaijiu Shang,Zhaoqi Zhou,Chuan Zhou,Qi Meng,Guiying Yan,Zhiming Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:The remarkable success of score-based diffusion models has spurred significant efforts to establish their theoretical foundations. However, existing complexity bounds for score approximation rely heavily on restrictive assumptions like Lipschitz continuous densities or smooth manifold supports, which are routinely violated by the singularities, sharp boundaries, and disjoint clusters inherent to real-world perceptual data. This work establishes a universal score approximation theorem that works for any distribution supported on any compact set of upper Minkowski dimension d . Using a novel discrete-mixture formulation, we prove that the score function can be approximated with a ReLU network whose complexity grows exponentially only with d , thus breaking the exponential curse of ambient dimensionality. Combined with existing theories on accurately solving the backward diffusion SDE for arbitrary compact distributions, our work shows that diffusion models readily adapt to irregular, non-smooth data structures, explaining their competence in real-world generative tasks.
[LG-50] Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses
链接: https://arxiv.org/abs/2606.19891
作者: Zhuoyu Cheng,Kohei Hatano,Eiji Takimoto
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study adversarial bandit optimization in which the loss functions may be non-convex and non-smooth. In each round, the learner selects an action and observes only the loss incurred at that action. The loss consists of an underlying convex and \beta -smooth component and an adversarial perturbation that may be chosen after observing the learner’s action. The perturbations are subject to a global budget controlling their cumulative magnitude over time. This framework extends the globally budgeted, post-action perturbation model from underlying linear losses to general convex and \beta -smooth losses. For this broader class, we establish expected regret guarantees that explicitly characterize the effect of the perturbation budget. To establish these guarantees, we modify a standard bandit optimization algorithm and develop an analysis that controls the additional regret caused by the perturbations. In the absence of perturbations, our results reduce to regret guarantees for the standard bandit convex optimization setting with \beta -smooth losses. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.19891 [cs.LG] (or arXiv:2606.19891v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.19891 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-51] Matching Markets meet Cumulative Prospect Theory: Towards Optimal and Adversarially Robust Learning ECML-PKDD2026
链接: https://arxiv.org/abs/2606.19883
作者: Ananya Kunisetty,Avishek Ghosh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ECML-PKDD 2026, Naples, Italy
Abstract:We study a multi-agent multi-armed bandit problem in the competitive setup with two-sided matching markets under a human centric decision making model. To capture human preferences, we use cumulative prospect theory (CPT) that weighs the actions of the agent in a nonlinear fashion using a ( \alpha -Hölder continuous) weight function. CPT has been widely used in behavioral economics and risk sensitive machine learning to emulate human preferences. We analyze the state-of-the-art learning algorithm with CPT weight distorted rewards and obtain a player optimal regret of \mathcalO(K\log T \left(\frac1\Delta\right)^2/\alpha) , where K denotes the number of arms, T is the learning horizon, and \Delta represents (suitably defined) players’ minimum preference gap. Noticing the dependence on \Delta to be sub-optimal, we further improve this regret by judiciously selecting the active set of arms during exploration, which removes the dependence on K in the dominant term and achieves an improved (optimal) regret guarantees in the setting where the number of arms K is significantly larger than the number of players N . In addition, we consider adversarial markets where the observed rewards of the agents may be corrupted. We propose and analyze algorithms for robust markets with CPT as risk sensitive measure in both settings where the total corruption budget is known and where it is unknown, and establish logarithmic player-optimal regret guarantees in both cases.
[LG-52] On the Oracle Complexity of Interpolation-Based Gradient Descent
链接: https://arxiv.org/abs/2606.19878
作者: Dongmin Lee,William Lu,Anuran Makur
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 16 pages, 2 figures
Abstract:Recent work on first-order optimizers for empirical risk minimization (ERM) has suggested that smoothness of ERM loss functions in the training data, rather than in the optimization parameters, can be leveraged to improve the oracle complexity of gradient descent (GD) methods. In this paper, we propose an inexact gradient method, piecewise polynomial interpolation-based gradient descent (PPI-GD), which approximates the full gradient in each iteration by querying the first-order oracle at equidistant points in the data domain to construct polynomial interpolants of the resulting gradient samples over appropriately sized patches of the data domain. We analyze the oracle complexity of PPI-GD for strongly convex and non-convex loss functions when the data space dimension is bounded by a polylogarithmic function of the number of training samples, and find it to outperform several GD variants in key regimes when the loss function is sufficiently smooth. Furthermore, our analysis extends several techniques from the error analysis of bicubic spline interpolants to the setting of d -variate tensor product polynomial interpolants which may be of independent interest in interpolation analysis.
[LG-53] Global Convergence of Gradient Descent for Score Matching in Gaussian Mixtures via Reverse Fisher Divergence
链接: https://arxiv.org/abs/2606.19876
作者: Alexander Tyurin
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The score matching problem is a central training objective in modern generative modeling, diffusion models, fitting unnormalized statistical models, and inverse problems. A standard approach is to minimize the forward Fisher divergence, where the expectation is taken with respect to the teacher distribution. However, recent results show that even in simple Gaussian mixture model settings, this objective can lead to undesirable and initialization-dependent convergence behavior. In this paper, we study an alternative objective: the reverse Fisher divergence, where the expectation is taken with respect to the student distribution. We analyze gradient descent (GD) for fitting Gaussian mixture models and show that this change in the objective leads to significantly better optimization properties. First, when the teacher distribution is a single Gaussian and the student is a Gaussian mixture model with fixed weights and identity covariances, we prove the global convergence of GD from arbitrary initializations. Second, we extend the analysis to the case where the teacher is also a Gaussian mixture model and prove global convergence guarantees under a global random initialization scheme and a \widetilde\Omega(1) -separation assumption on the target means. In particular, with high probability, each student component converges near its closest teacher component, and we provide conditions under which the student distribution converges in total variation distance. Our proofs rely on a new Lyapunov-based analysis of the gradient descent dynamics, showing that the reverse Fisher divergence has a much more favorable optimization landscape than the forward Fisher divergence.
[LG-54] Doeblin Curves
链接: https://arxiv.org/abs/2606.19859
作者: Dongmin Lee,William Lu,Anuran Makur,Japneet Singh
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 42 pages, 2 figures
Abstract:Recent research on Doeblin coefficients has shed light on their usefulness as a multi-way generalization of the Dobrushin contraction coefficient for TV distance, in a separate vein from their classic role in the theory of Markov chain ergodicity. However, strong conditions, such as being bounded away from 0, are typically necessary for Doeblin coefficients to establish the existence of information contraction. Building on recently formulated concepts of nonlinear information contraction, we aim to propose a finer-grained Doeblin-based characterization of multi-way contraction behavior which yields non-vacuous contraction guarantees even for channels whose Doeblin coefficient is 0. To this end, we introduce the notion of a Doeblin curve – a nonlinear function which quantifies the contraction behavior of a Markov kernel on collections of input distributions at specific levels of divergence and power. Through the course of our analysis, we develop a new variational characterization of Doeblin coefficients, present several properties of Doeblin curves, define several versions of power-constrained Doeblin curves, and derive upper and lower bounds using our aforementioned variational characterization. We then utilize these results in diverse areas, including generalization bounds for noisy iterative optimization, error bounds for reliable computation with noisy circuits, and differential privacy guarantees for online iterative algorithms. In particular, we extend results in these areas to broader domains or group settings, leveraging Doeblin curves to reveal finer-grained contraction phenomena than Doeblin coefficients.
[LG-55] Physics-Informed Neural Network with Squeeze-Excitation-like Attention
链接: https://arxiv.org/abs/2606.19853
作者: Yun-Fei Song,Long-Gang Pang,Fu-Peng Li,Jun-Jie Zhang
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 15 pages, 6 figures
Abstract:We introduce SEA-PINN, a novel architecture that incorporates a Squeeze-Excitation-like attention mechanism into physics-informed neural networks to dynamically recalibrate the importance of neurons across layers. A key feature of SEA-PINN is its highly stable initialization. On 17 out of 20 benchmark problems, SEA-PINN exhibit nearly negligible variance and significantly reduced initial loss, establishing a quasi-deterministic and favorable starting point for optimization. Notably, without employing Fourier feature embeddings or periodic activation functions, SEA-PINN attained competitive accuracy (83% vs. 90% improvement relative to FNN-PINN on the high-frequency case 7) as compared with TSA-PINN-a model specifically engineered for high-frequency problems via learnable frequencies in sinusoidal activations. Furthermore, integrating SEA-PINN into TSA-PINN boosted performance by 42.49%. These results underscore SEA-PINN as a lightweight plug-in module that enhances nonlinear representation power, promotes more robust and efficient convergence, and strengthens the overall reliability of physics-informed learning.
[LG-56] Enhancing Graph Neural Networks Using Proximity Graphs for Dust Source Emission Forecasting
链接: https://arxiv.org/abs/2606.19825
作者: Maryam Sanisales,Zahed Rahmati,Ali Darvishi Boloorani,Ali Vefghi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of dust source emissions is critical for mitigating the significant environmental and health hazards posed by dust storms. Traditional forecasting methods often struggle to capture the complex spatiotemporal dynamics of these phenomena. In this paper, we demonstrate that proximity graphs enable Graph Neural Networks (GNNs) to effectively model the intricate spatial and temporal relationships between data points. Specifically, we use proximity graphs–such as Delaunay triangulation, Gabriel graph, k-Nearest Neighbor graph, and Yao graph–as the input for GNNs (including GraphSAGE, Graph Convolutional Networks, and Graph Attention Networks) to perform message passing. Our approach highlights the effectiveness of integrating proximity graphs with GNNs for robust and accurate dust source forecasting. To emphasize the importance of proximity graph representations, we compare our method against GNNs using random graphs for message passing. The results show that GNNs with proximity graphs significantly outperform those with random graphs and are also far superior to Long Short-Term Memory (LSTM) model in dust source emission forecasting. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.19825 [cs.LG] (or arXiv:2606.19825v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.19825 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-57] he Hidden Environmental Cost of Poor Coding Practices in TensorFlow and Keras Applications: A Study on Resource Leaks and Carbon Emissions
链接: https://arxiv.org/abs/2606.19799
作者: Bashar Abdallah,Gustavo Santos,Rola Al Bataineh,Alain Abran,Mohammad Hamdaqa
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Efficiency and sustainability are critical considerations in the development and deployment of machine learning (ML) applications. Among the factors influencing sustainability, resource leaks in ML code can introduce hidden inefficiencies that elevate energy consumption and CO2 emissions. Despite this, empirical evidence quantifying their environmental impact remains limited. This emerging results paper presents an initial empirical investigation of two common resource-leak smells, namely Improper Model Reuse (IMR) and Unreleased Tensor References (UTR), and their impact on energy consumption and CO2 emissions in TensorFlow and Keras workloads. Controlled experiments were conducted for each smell by executing identical training tasks while comparing against a smell-free baseline. Our preliminary results show that both smells consistently increase estimated electricity usage and carbon emissions. IMR and UTR increased electricity consumption by approximately 32% and 46%, respectively, with proportional increases in CO2 emissions. Paired statistical tests indicate that these differences are systematic and statistically significant, providing initial empirical evidence that resource-leak smells may degrade ML energy efficiency and environmental sustainability. These findings suggest that resource-leak smells pose measurable risks to both software quality and sustainability, emphasizing the importance of integrating resource-lifecycle management and energy-efficiency considerations into ML development.
[LG-58] An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling
链接: https://arxiv.org/abs/2606.19770
作者: Itsuki Nakagawa,Kenji Yamanishi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose an information-theoretic framework for graph novelty generation, which aims to generate data that are distinct from existing patterns while preserving global structural consistency. Our approach embeds data into a latent space, models the latent distribution using finite mixture models, and generates novel samples by imposing explicit novelty and reliability conditions formulated in terms of description length. Specifically, novelty is enforced by requiring generated samples to be poorly explained by all existing mixture components, while reliability constrains their impact on the overall mixture structure under the Minimum Description Length (MDL) principle. We provide a theoretical analysis showing that, with appropriate threshold choices, the probabilities of misclassifying non-novel or unreliable samples converge to zero with explicit rates. Experiments on synthetic and benchmark graph datasets demonstrate that the proposed method enables principled novelty generation with quantifiable risk.
[LG-59] Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System
链接: https://arxiv.org/abs/2606.19754
作者: Zhiwen Yu,Derong Yang,Liujian Zhang,Kaixiang Yang,Peilin Zhan,Jianmin Lv,Jane You,C. L. Philip Chen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Partial differential equations (PDEs) play a central role in modeling complex physical, biological, and engineering systems. While traditional numerical solvers are robust, they often incur prohibitive computational costs due to mesh dependencies, whereas recent Physics-Informed Neural Networks (PINNs) offer a mesh-free alternative but frequently suffer from slow convergence and optimization instability. To bridge this gap, this article proposes the Physics-Informed Broad Learning System (PIBLS), a novel backpropagation-free framework that reformulates PDE solving as a direct least-squares optimization. We improved an algorithm within this framework to handle nonlinear PDEs efficiently and provide a rigorous mathematical proof establishing the universal approximation property of PIBLS for these equations. Experiments on linear and nonlinear PDEs demonstrate that PIBLS is one to three orders of magnitude faster than conventional PINNs while achieving significantly higher solution accuracy. This framework provides a computationally efficient paradigm for scientific machine learning, offering a practical, high-speed alternative for real-time simulation and design optimization tasks.
[LG-60] Federated Bilevel Performative Prediction ICML2026
链接: https://arxiv.org/abs/2606.19734
作者: Liangxin Qian,Chang Liu,Xuanyu Cao,Jun Zhao,Kwok-Yan Lam
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026
Abstract:Federated bilevel optimization is widely used for nested learning problems across distributed clients, such as federated hyperparameter tuning and meta-learning under privacy and communication constraints. Most existing formulations assume fixed client data distributions, which can be violated by performativity, where deployed decisions reshape client behavior and data collection, inducing client-specific, decision-dependent distribution shift. We study federated bilevel performative prediction, where both upper-level (UL) and lower-level (LL) objectives are evaluated under client-dependent, decision-dependent distributions. We formalize the federated bilevel performatively stable (FBPS) point under a decoupled-risk perspective and provide sufficient conditions for its existence and uniqueness. We then develop two federated methods to compute the FBPS solution: FBi-RRM, which converges linearly under a contraction condition, and FBi-SGD, a communication-efficient stochastic method based on federated hypergradient estimation with convergence guarantees under diminishing step sizes when sensitivities are sufficiently small. Experiments on strategic regression and meta strategic classification validate the predicted stability thresholds and demonstrate improved meta-generalization over non-performative baselines, and CNN-based classification further demonstrates the practical effectiveness of the proposed methods in nonconvex neural network settings.
[LG-61] A Differentiable Composite Approximation Framework for Autonomous Underwater Vehicle Maneuvering Modeling from Sea-Trial Data
链接: https://arxiv.org/abs/2606.19711
作者: Aobo Wang,Aifei Xia,Zihao Wang,Lizhu Hao
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Field-based modeling from onboard measurements can produce autonomous underwater vehicle (AUV) maneuvering models that reflect real operating characteristics. From an approximation perspective, conventional maneuvering models use predefined constraint polynomial bases, whereas data-driven models use data-adaptive bases. Motivated by this basis-function view, this paper presents a differentiable composite-approximation formulation, in which the polynomial-basis component and the data-adaptive basis component are treated as differentiable parts of a single predictor and calibrated jointly. A gradient-based co-calibration method is developed for full-scale AUV maneuvering prediction, where a sensitivity-aware mechanism regulates bounded polynomial updates while the neural residual captures remaining nonlinear discrepancies under a shared prediction objective. To account for ocean-current effects in field data, a turning-motion-based current estimation and compensation procedure is incorporated to construct current-compensated learning targets for training and rollout. The framework is evaluated using sea-trial data collected from a 7-meter AUV under multiple maneuvering conditions. Results show that the proposed method improves recursive trajectory and velocity prediction compared with polynomial-only, neural-only, and frozen-prior hybrid baselines, demonstrating its applicability to field-data-based AUV maneuvering modeling.
[LG-62] Comparative Study on Agility Efficiency and Impact Absorption of Bipedal Robots with Active Toes
链接: https://arxiv.org/abs/2606.19699
作者: Joong-Gil Kim,Wontae Ye,Geunwoo Cho,Seong-Ho Yun,Se-Hyoung Cho,Yong-Jae Kim
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages, 7 figures
Abstract:Human legs exhibit high efficiency, agility, and impact absorption, with toes playing a crucial role in these capabilities. While many attempts have been made to implement human-like toes in robots, they have not fully replicated human characteristics nor rigorously validated their benefits. We propose a 14-DOF biped robot emulating human toes’ lightweight, high-torque, robust nature. To quantitatively analyze the effectiveness of the active toes in terms of agility, efficiency, and impact absorption, we developed a high-fidelity simulation training environment that reflects actual actuators with coupled transmissions and accurate power consumption. To ensure a fair comparison between configurations with and without active toes, we designed a minimal RL reward function and applied an identical training procedure to both. The simulation results indicate that, at 1.33 m/s walking, the toe-equipped robot reduced CoT by 17.5% and heel-strike GRF by 5.0% compared with the toe-ablation configuration. On the agility test, average and maximum path deviation decreased by 25.0% and 34.0%, respectively.
[LG-63] Multi-Granular Attention-Driven Reinforcement Learning Framework for Web Intelligent Enhancement Systems
链接: https://arxiv.org/abs/2606.19690
作者: Navin Chhibber,Deepak Singh,Anokh Kishore,Nikita Chawla,K. Anguraj
类目: Machine Learning (cs.LG)
*备注: 2026 3rd International Conference on Integrated Intelligence and Communication Systems (ICIICS), 6 Pages
Abstract:From the past few years, web intelligent enhancement systems increasingly rely on heterogeneous and dynamic web data to deliver personalized, context-aware services. However, traditional machine learning, deep learning, and reinforcement learning models often struggle with semantic understanding, adaptability, and scalability in continuously evolving web environments. In this research, a Multi-Granular Attention-based Reinforcement Web Intelligent Enhancement System (MGAR-WIES) is proposed to address the challenges by integrating semantic graph modeling, attention mechanisms, and adaptive reinforcement learning. Initially, heterogeneous web data comprising structured, semi-structured and unstructured sources are collected and preprocessed for generating unified feature representations. These representations are transformed into a dynamic semantic graph, where entities and their relationships are modeled by using graph embeddings enhanced by attention mechanisms for capturing both local relevance and global contextual dependencies. Subsequently, an adaptive multi-agent reinforcement learning strategy leverages the attention-aware semantic states to optimize personalized web actions like content recommendation, navigation optimization, and service adaptation. Finally, the continuous online feedback is further integrated to update graph representations and learning policies in real time by ensuring sustained adaptability and performance. The proposed MGAR-WIES acheived better results in terms of accuracy (80%) when compared with existing approaches.
[LG-64] DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning ICML2026
链接: https://arxiv.org/abs/2606.19656
作者: Calvin Luo,Chen Sun,Shuran Song
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:A natural recipe for intelligent robotic decision-making is initializing from pretrained generative control policies, which have summarized offline experience, and adapting them to self-collected online experience. We present DF-ExpEnse, an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the generative control policy to create an expressive and tractably evaluatable candidate set. It then utilizes an ensemble of critics to identify the action that best balances quality with high exploration interest. In fleet settings, DF-ExpEnse further enables cross-agent communication to facilitate collaborative exploration as a group. DF-ExpEnse can be seamlessly integrated with existing strategies that finetune pretrained generative control policies via reinforcement learning. We experimentally validate consistent sample-efficiency benefits through DF-ExpEnse across a variety of manipulation and locomotion tasks, compared to default finetuning and alternative action selection schemes. Project can be found at this https URL.
[LG-65] Convex training of Lipschitz-regularized shallow neural networks
链接: https://arxiv.org/abs/2606.19652
作者: Chao Yin,Antoine Lesage-Landry
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we introduce a training procedure for shallow neural networks that promotes robustness against adversarial attacks. We solve a non-convex Lipschitz-regularized training program by introducing a convex restriction that can be efficiently solved to global optimality. Our approach can be employed as a post-processing step by taking a pre-trained network as an initial solution to then solving the convex program whose optimal network is guaranteed to be no worse than the initial one. We illustrate the improvements of our training procedure with experiments using real world datasets for regression tasks under an adversarial setting. We show numerically that solving our proposed convex program yields networks with lower objective values on the Lipschitz-regularized program compared to existing methods. Additionally, we show that on certain datasets, networks obtained using our convex training program are both more accurate and robust with respect to adversarial attacks.
[LG-66] MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery
链接: https://arxiv.org/abs/2606.19624
作者: Hongxuan Liu,Roman Bushuiev,Ivy Lightheart,Mrunali Manjrekar,Anton Bushuiev,Magdalena Lederbauer,Filip Jozefov,Yinkai Wang,Soha Hassoun,Josef Sivic,James Taylor,Runzhong Wang,David Healey,Tomáš Pluskal,Connor W. Coley
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reliable benchmarking is critical for developing machine learning models for tandem mass spectrometry (MS/MS) based molecule discovery. Subtle issues in experimental design and model evaluation procedures can degrade the trustworthiness of such benchmarks and lead to erroneous conclusions. We conduct a thorough review of model evaluation issues in the recent MS/MS machine learning literature, using the standard MassSpecGym benchmark suite as a case study to illustrate the impact of these issues. We find evaluation issues in at least 17 of 26 papers reporting MassSpecGym benchmark results in the first year of its adoption. We isolate three classes of failures: (i) data leakage, (ii) shortcut learning, and (iii) implementation bugs and metric divergence. Through extensive experimentation and code replication, we quantify the impact of these issues and show how they corrupt the evaluation standards MassSpecGym was designed to enforce. We distill our findings into recommendations generalizable to MS/MS challenges, benchmarks, and custom evaluation setups. We also release MassSpecGym v1.5, an implementation of our recommendations in the MassSpecGym benchmarking suite which addresses the failure modes identified in this audit. MassSpecGym v1.5 is publicly available at this https URL.
[LG-67] SEAGAN: domain-Specific and Edge-Aware Graph Attention Network for Dynamic Plant Processes
链接: https://arxiv.org/abs/2606.19623
作者: Antriksh Srivastava,Soumyashree Kar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks (GNNs) provide a flexible framework for learning from scientific data linked through physical, biological, or functional relationships. One promising domain is plant physiology, where measured responses often arise from multiple interacting processes whose exact separation remains difficult even with manual intervention. In plant physiology, a key example is the A-Ci curve, which relates net CO2 assimilation rate (Anet) to leaf intercellular CO2 concentration (Ci) and is used to estimate photosynthetic parameters in leaf and crop-canopy models. However, reliable estimation requires identifying the active biochemical limitation state at each curve point, which remains a major source of uncertainty. Here, we formulate limitation-state identification along A-Ci curves as a graph-based node classification problem, with curve points as nodes. Domain-specific graph representations are created using distance-based k-nearest-neighbor (kNN) and auxiliary-signal-guided (ASG) connectivity, with edge attributes encoding pairwise relations. The framework was evaluated against conventional learning baselines, graph-based architectures, and an automated fitting-based benchmark. Results on a large synthetic dataset with known ground-truth limitation states show that graph-based models improve classification, particularly near biochemical transition regions. The best-performing configuration, SEAGAN (domain-Specific and Edge-Aware Graph Attention Network for Dynamic Plant Processes), integrates process-aware node features, edge attributes, kNN connectivity, and graph attention with weighted cross-entropy loss, achieving an F1-score of 0.857 and an accuracy of 0.882. The results show that representing A-Ci curves as graphs improves biochemical limitation-state analysis, with edge-aware attention over local kNN neighborhoods providing the most effective strategy.
[LG-68] Comparing Linear Probes with Mahalanobis Cosine Similarity
链接: https://arxiv.org/abs/2606.19603
作者: Zhuofan Josh Ying,Peter Hase,Nikolaus Kriegeskorte
类目: Machine Learning (cs.LG)
*备注: 16 pages, 10 figures
Abstract:Linear probes are widely used in interpretability research and often compared by cosine similarity. The Mahalanobis cosine similarity (MCS) between two directions, which reweights the inner product by test data covariance, is a natural task-aware refinement. Ying et al. (2026) report that a probe’s MCS to a reference probe trained on the out-of-distribution (OOD) data near-perfectly linearly predicts the probe’s OOD AUROC (R^2 = 0.98). Here, we extend this empirical finding across models, layers, and concept domains, and prove this general phenomenon in closed form: For balanced classes whose projections are Gaussian, OOD AUROC and MCS to the reference probe are linear because both are sigmoid-shaped functions of the probe’s signal-to-noise ratio (SNR) on the test data. The theory also predicts when this linearity fails, which we verify empirically. MCS offers a theoretically grounded and empirically effective alternative to Euclidean cosine similarity for comparing linear probes.
[LG-69] Unsupervised Causal Abstractions Discovery
链接: https://arxiv.org/abs/2606.19594
作者: Théo Saulus,Simon Lacoste-Julien,Dhanya Sridhar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Causal abstractions formalize when a high-level structural causal model (SCM) captures the interventional behavior of a lower-level SCM. Existing applications of this notion largely follow a hypothesis-testing paradigm: an expert proposes a candidate high-level model and then evaluates if the low-level system implements it. We study the complementary problem of learning a high-level model directly from low-level measurements. Our contributions leverage hypotheses from low-rank causal discovery, and can be summarized as follows: (1) we show that observations generated by a low-rank graph induce latents that form a causal abstraction, (2) we provide identifiability results about these latents, and (3) we propose a practical objective to learn this high-level SCM.
[LG-70] On the QUEST for Uncertainty Quantification via Highest Density Regions
链接: https://arxiv.org/abs/2606.19569
作者: Sam Goring,Tom Kuipers,Nicola Paoletti,David S. Watson
类目: Machine Learning (cs.LG)
*备注: 27 pages, of which 10 are main text. Contains 7 figures, 4 tables, 1 algorithm in total
Abstract:Uncertainty quantification (UQ) is essential for reliable decision-making in safety-critical applications in probabilistic machine learning. For regression problems, dominant scalar UQ approaches - notably, those based on proper scoring rules - measure uncertainty via pointwise predictive risk. This can lead to counterintuitive results when the target statistic is not the conditional expectation. We propose an alternative framework, in which uncertainty is characterised by the volume of the most probable subset of a distribution’s support. QUEST (Quantifying Uncertainty via highest dEnSiTy regions) is a novel approach to UQ based on the concentration of Lebesgue measure at a distribution’s peak(s), evaluated at one or more values of a robustness parameter \alpha . We establish connections between our measures and classical statistics from information theory and economics. We show that, unlike popular alternatives based on proper scoring rules, QUEST measures of epistemic and aleatoric uncertainty satisfy a set of axioms adapted from the UQ literature, including monotonicity under distributional spread and invariance to location shifts. Selective prediction benchmarks confirm that QUEST performs favourably against standard measures such as variance and differential entropy.
[LG-71] Advances in Scientific Machine Learning for Coupled Fluid Flow and Transport
链接: https://arxiv.org/abs/2606.19562
作者: Gabriel F. Barros,Rômulo M. Silva,Alvaro L. G. A. Coutinho
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:This chapter reviews recent advances in Scientific Machine Learning (SciML) for modeling coupled fluid flow and transport phenomena governed by the incompressible Navier-Stokes and scalar transport equations. Such systems, found in applications like turbidity currents and thermal convection, feature strong nonlinear coupling and multiscale behavior that make high-fidelity simulations computationally expensive. To address this, the chapter surveys state-of-the-art SciML methods for building efficient surrogate models, including linear reduced-order techniques based on Singular Value Decomposition (such as Dynamic Mode Decomposition) and nonlinear neural network approaches like Physics-Informed Neural Networks (PINNs) and \beta -Variational Autoencoders ( \beta -VAEs). It first covers the authors’ work combining these models with High Performance Computing strategies, including Adaptive Mesh Refinement/Coarsening (AMR/C) and scientific floating-point data compression. It then presents two new contributions: surrogate modeling of turbidity currents via PINNs, and the extraction of disentangled nonlinear modes from thermal flows using \beta -VAEs. Governing equations and representative benchmarks, including lock-exchange flows and Rayleigh-Bénard convection, illustrate these methodologies. The chapter is intentionally long, covering both the mathematical and physical foundations of coupled fluid flow and the computational aspects of state-of-the-art modeling. Overall, it demonstrates how SciML enables fast, accurate approximations of complex coupled systems within the specific data regimes and modeling assumptions considered, while substantially reducing computational cost relative to full-order simulations. Broader capabilities such as real-time prediction and uncertainty quantification remain active research directions whose feasibility depends strongly on the problem at hand.
[LG-72] Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting
链接: https://arxiv.org/abs/2606.19560
作者: Alireza Jafari,Judy Fox,Geoffrey C. Fox,Madhav Marathe,Aniruddha Adiga
类目: Machine Learning (cs.LG)
*备注: 15 pages, 2 figures, 9 tables
Abstract:Seasonal influenza infects millions of people and causes substantial morbidity and mortality in the United States each year, making accurate short-term forecasting a core public-health need. Reliable forecasts of epidemic time series can inform vaccination timing, hospital staffing, and resource allocation, yet the comparative behavior of modern forecasting architectures on infectious-disease surveillance data remains insufficiently characterized. We address this gap through a systematic evaluation of regional influenza forecasting using influenza-like illness surveillance and influenza-associated hospitalization time series under both temporal and spatial generalization settings for 1-4-week-ahead prediction. We compare classical neural network architectures, numerical transformer-based models, pretrained time series foundation models, and LLM-based forecasting approaches. Across tasks, we demonstrate that a mixture-of-experts model that fuses multiple pretrained forecasters achieves the strongest overall performance, indicating that heterogeneous pretrained representations provide complementary predictive information. Our results further show that numerical transformer-based models produce reliable forecasts, while pretraining provides the largest gains at longer horizons, particularly when the pretraining domain is mechanistically aligned with influenza dynamics. In contrast, LLM-based time series methods underperform relative to numerical forecasters in this setting. Finally, we examine hospitalization information as both an auxiliary covariate and a pretraining source. Hospitalization signals provide complementary improvements in selected settings and clarify when additional surveillance streams enhance the robustness of multi-horizon forecasting. These findings provide actionable guidance on model selection, pretraining strategy, and auxiliary-signal use for influenza preparedness.
[LG-73] Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
链接: https://arxiv.org/abs/2606.19549
作者: Lin Tang,Wei Zhang,Jing Li,Hongyu Chen,Ming Zhao,Yuxuan Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Low-rank adaptation (LoRA) makes it cheap to train many domain- and task-specific language model adapters, but whether two adapters can be merged is usually discovered only after both have been fully trained and evaluated. This late feedback is costly: adapters that are strong in isolation can interfere destructively once their updates are combined. We ask whether this outcome can be anticipated. We formalize adapter mergeability as the degree to which an adapter preserves its single-task utility after merging, and show that it can be forecast from signals measured in the first few percent of training – chiefly how the low-rank updates and their gradients align across tasks and how much they disturb shared representations. We package these signals into MergeProbe, a lightweight predictor that estimates pairwise and set-level retention and turns the estimate into a concrete decision: merge directly, reweight, prune, or route. On MERGE-PEFT, a five-domain benchmark spanning math, code, science, instruction following, and safety, MergeProbe attains the best average and worst-case retention among strong interference-aware merge baselines while adding far less deployment overhead than full task routing. This turns LoRA merging from a post-hoc engineering step into an anticipatory measurement problem.
[LG-74] racking Representation Dynamics in Large Language Models with Persistent Homology
链接: https://arxiv.org/abs/2606.19542
作者: Naman Malhotra,Jay Ambadkar,Abhinav Gupta,Kushal Kasivel,Abbas Schwarz,Kamillo Ferry,Anthea Monod
类目: Machine Learning (cs.LG)
*备注: 29 pages
Abstract:Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking the topology of activation spaces throughout fine-tuning. Across four transformer language models ranging from 1B to 7B parameters and three alignment objectives corresponding to helpful, harmless, and mixed training data, we find that the majority of topological reorganization occurs during the earliest stages of training. A dense checkpoint analysis reveals a transient peak in topological activity followed by rapid stabilization. We further show that different alignment objectives induce distinguishable topological trajectories, while instruction-tuned and pretrained models exhibit qualitatively different patterns of evolution. Our results suggest that persistent homology provides a complementary perspective on alignment, revealing representation-level changes that are not apparent from behavioral metrics alone.
[LG-75] FloatDoor: Platform-Triggered Backdoors in LLM s
链接: https://arxiv.org/abs/2606.19535
作者: Nils Loose,Jonas Sander,Felix Mächtle,Thomas Eisenbarth
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurably different outputs depending on the deployment platform, a consequence of non-associative floating-point arithmetic and divergent kernel implementations. We study the security implications of this platform-dependent variability and uncover a novel attack surface on LLM deployments. We introduce FloatDoor, the first input-independent, platform-triggered backdoor attack against generative LLMs. The compromised model exhibits adversary-chosen behavior when served on a target platform and is otherwise benign. FloatDoor is realized through two lightweight LoRA adapters, one that amplifies inter-platform numerical divergence and one that binds the resulting platform signature to a malicious downstream task, while leaving aggregate model utility largely intact. FloatDoor exploits a pronounced time-of-check, time-of-use gap between model auditing and serving. We demonstrate FloatDoor on Qwen3-4B across a broad range of deployment targets, including NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710. As a final case study, we show that FloatDoor reliably induces exploitable code vulnerabilities on a chosen target platform. Our results establish a new class of attacks on LLM deployments and underscore the pressing need for trusted model supply chains in sensitive, LLM-powered applications.
[LG-76] Interactive Pareto navigation for deep multi-task learning
链接: https://arxiv.org/abs/2606.19521
作者: Augustina C. Amakor,Konstantin Sonntag,Sebastian Peitz
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In multi-task learning, handling an increasing number of objectives can quickly become challenging, both in terms of the computational resources and the decision maker’s capacity to choose appropriate trade-offs. A widely used approach is thus to aggregate the individual losses in a single loss function by a weighted sum. This often fails to capture either the decision maker’s preferences as a result of the shape of the Pareto front, or requires multiple adjustments and computations which becomes prohibitively expensive in deep learning applications. To address these issues, we introduce a novel framework, Preference Pareto Exploration (PPE), which enforces the decision maker’s preferences while accounting for the geometry of the Pareto set in an interactive exploration process. PPE is based on a predictor-corrector method that performs predictor steps tangential to the manifold of Pareto-optimal solutions, following the decision maker’s preference. The subsequent corrector step results in a new trade-off reflecting this preference. To avoid explicit Hessian computations when characterizing the tangent space of the manifold, we employ a Krylov subspace method that relies solely on matrix-vector products. These products can be efficiently obtained via automatic differentiation, ensuring both efficiency and robustness throughout the optimization process. The method’s functionality and performance are demonstrated using both toy problems and examples from deep learning.
[LG-77] Calibrating Generative Models to Feature Distributions with MMD Finetuning
链接: https://arxiv.org/abs/2606.19496
作者: Nathaniel L. Diamant,Brian L. Trippe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative models can produce individually plausible samples while deviating substantially from a target set in the distribution of key features. For example, a model pretrained on broad drug-like chemical space may generate molecules whose molecular features differ from those of a therapeutic class of interest, such as known antibiotics. Correcting such distributional miscalibration is challenging: direct finetuning on the target set can overfit and does not control which features are matched. To fill this gap, we introduce kernel Calibrating Generative Models (kCGM). kCGM minimizes a maximum mean discrepancy (MMD) between generated and target feature distributions using an unbiased score-function estimator, with KL regularization to remain close to the pretrained model. On a target set of 174 antibiotics, direct finetuning sacrifices chemical validity for feature-distribution matching, whereas kCGM improves target feature matching while increasing validity. We further demonstrate kCGM in protein and DNA generation tasks, showing it can adapt autoregressive, continuous-space diffusion, and discrete diffusion models using only feature-level supervision. Code is available at this https URL.
[LG-78] Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale
链接: https://arxiv.org/abs/2606.19491
作者: Tejas Pradeep Shirodkar,P. J. Narayanan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages, 7 figures, 6 tables. Empirical companion to arXiv:2606.05957
Abstract:Pretrained transformers sit near singular minima of the loss, where the Fisher information metric degenerates along dead directions: directions in parameter space along which the directional Fisher vanishes. Locating such a direction normally needs a forward pass and an eigendecomposition of activations, or a sampling-based complexity estimate; none returns a direction computable from the network’s parameters alone. We give one, for LayerNorm transformers. The inverse-scale direction \gamma^-1/|\gamma^-1| of the LayerNorm affine is an exact algebraic kernel of the post-final-norm centred activation covariance, for any input distribution, and induces a corresponding dead direction in parameter space. It is read from the LN scale parameter alone, with no forward or backward pass and no eigensolve: the cheapest dead-direction read, specific to LayerNorm. We test it on 14 pretrained transformers ( 9 LayerNorm, 5 RMSNorm; 160 M- 35 B; language and vision objectives). At random initialisation the predicted direction matches the measured bottom singular direction (one forward pass, direct SVD) to four decimal places on 9/9 LayerNorm models, and is correctly absent on 5/5 RMSNorm models, which lack the mean-subtraction projector that creates it. On the trained checkpoint the covariance eigenvalue along this direction deepens by \sim10^3\times and further dead directions open; the random-init-to-trained gap is a one-forward-pass, per-checkpoint readout of singular structure along the predicted coordinate. Two consequences follow in closed form: the residual stream’s smallest singular value is preserved block-to-block on 13/14 transformers measured on their own input distribution, the one exception (Gemma 4 - 31 B) a genuine dead direction the same read pinpoints; and the kernel direction’s presence classifies a transformer’s normalisation from the parameters alone.
[LG-79] Insulin4RL: Real-Time Insulin Management in the Intensive Care Unit for Offline Reinforcement Learning
链接: https://arxiv.org/abs/2606.19481
作者: Thomas Frost,Steve Harris
类目: Machine Learning (cs.LG)
*备注: Under submission
Abstract:Offline reinforcement learning (ORL) offers the potential to improve the quality of clinical decision-making using historical electronic health record (EHR) data. Current training and evaluative practices in this field rely heavily on EHR datasets that have been temporally discretised into fixed, regular time intervals. Discretisation creates fictional representations of complex clinical scenarios and compromises the generalisability of retrospective model evaluations. In this paper, we introduce Insulin4RL, a healthcare ORL dataset featuring naturally irregular inputs and actions from real clinical trajectories. Derived from MIMIC-IV, Insulin4RL comprises over 375,000 labelled decisions across 12,209 patients requiring insulin infusion titration in the Intensive Care Unit. The dataset can thus be used for research into ORL model performance under realistic clinical sampling assumptions. We provide a description of the dataset’s structure and characteristics, baseline performance metrics using model-free offline reinforcement learning, and a standardised evaluation protocol using fitted Q-evaluation. We conclude with suggested areas for future research that could be addressed using this resource.
[LG-80] MortarBench: Evaluating Mortgage Loan Origination Agents
链接: https://arxiv.org/abs/2606.19416
作者: Matthew Toles,Yunan Lu,Manav Munjal,Bojun Liu,Yuanhao Deng,Stephanie Selig,Derek Rindner,Cheng Li,Zhou Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Loan origination is the process by which a lender creates a new loan, from application and underwriting through approval and funding. This process serves a critical role in evaluating the eligibility and level of risk posed by an applicant. Recently, firms have begun using mortgage loan agents to augment human loan officers, despite a lack of any public benchmark. To fill this gap, we present MortarBench, a loan origination agent benchmark. MortarBench uses a financial data synthesis and mutation pipeline to generate examples with broad edge case coverage that match real-world distributions and questions. We find that state-of-the-art large language models (LLMs) perform poorly, with closed-source models achieving at most 77.1% exact match accuracy. We also discover systematic biases in LLM perception of foreignness related to non-English names. Noting these weaknesses, we introduce CRIT, a confidence calibration framework. Our method increases accuracy to 80.5% while improving risk management steering and reducing bias.
[LG-81] Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting
链接: https://arxiv.org/abs/2606.19413
作者: Huu Hiep Nguyen,Minh Hoang Nguyen,Dung Nguyen,Hung Le
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal time series forecasting, which pairs numerical sequences with domain-relevant textual reports, promises to inject world knowledge into forecasting pipelines. However, we uncover a critical failure mode in existing frameworks that we term text collapse: the text branch converges to a content-independent transformation, contributing negligible discriminative signal regardless of the input description. We argue that text collapse is a consequence of a fundamental asymmetry in time series forecasting: the numerical input is strongly autocorrelated with the output, making the numerical backbone inherently dominant, while the text branch, despite carrying complementary and often critical information, is insufficiently utilized, leading to its systematic underexploitation. To address this, we propose \textbfREST-TS (\textbfResidual-\textbfExclusive \textbfSupervision for \textbfText in \textbfTime \textbfSeries), which turns the asymmetry into a design principle: the numerical backbone produces its own independent numerical forecast, and the text branch is exclusively supervised to predict the structured components of the residual, the prediction gap that numbers cannot explain. Because no numerical pathway can reduce these losses, the text branch must extract genuine content from the input description. Evaluated across diverse real-world domains and backbone architectures, REST-TS achieves state-of-the-art performance and consistently demonstrates greater text-branch utilization than existing frameworks, providing strong empirical evidence that supervising the text branch on the residual compels it to extract genuine content from the input.
[LG-82] Spectral Retrieval-Augmented Time-Series Forecasting
链接: https://arxiv.org/abs/2606.19412
作者: Huu Hiep Nguyen,Minh Hoang Nguyen,Dung Nguyen,Hung Le
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting leverages historical patterns to predict future values, but traditional methods face challenges when dealing with complex, non-stationary patterns that are difficult to memorize during training. Retrieval-augmented approaches have emerged as promising solutions by retrieving similar historical patterns to enhance predictions. However, existing retrieval methods suffer from two fundamental limitations: spectral blindness, which overlooks critical frequency-domain characteristics that capture underlying periodic structures, and temporal recency, which treats all historical data equally without emphasizing recent, more relevant patterns. In this paper, we propose SpecReTF, a novel retrieval method that addresses these issues by converting time series into windowed frequency representations, measuring similarity with a combined metric that captures both amplitude and phase information. To balance recency and historical context, we apply an exponential moving average weighting scheme that emphasizes recent windows. Extensive experiments on benchmark datasets demonstrate that SpecReTF outperforms time-domain retrieval methods, achieving superior forecasting accuracy across diverse, non-stationary time series.
[LG-83] Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection
链接: https://arxiv.org/abs/2606.19411
作者: Richard Yi Da Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Selecting a small, diverse, high-quality subset from a massive pool of candidates is a recurring primitive in modern machine learning – data curation and coreset selection for training and fine-tuning large models, active-learning batch acquisition, prompt and exemplar selection for in-context learning, retrieval diversification, and experimental design. Determinantal Point Processes (\DPP s) give a principled, well-calibrated notion of diversity for this task, but their \emphMAP objective – pick a size- k subset S maximizing \logdet(L_S) – is NP-hard, and the standard greedy and sampling algorithms scale superlinearly in the ground-set size n . This cost is prohibitive precisely in the data-centric regime where diversity matters most, where n ranges over millions to billions of candidate examples, features, or embeddings. We recast \DPP-MAP as a continuous optimization problem over the Stiefel manifold, and show that its first-order optimality conditions form a \emphNonlinear Eigenvalue Problem with eigenvector dependency (\NEPv) of a previously unstudied form. This \NEPv\ admits a self-consistent field (\SCF) iteration with a spectral-gap-based local contraction guarantee, giving a principled iterative solver where the diversity objective drives an eigenvector-dependent operator. The resulting algorithm, \OurMethod, requires only matrix-vector products with the kernel and runs in time O!\big((ndk+nk^2),t\big) for a small number of iterations t , scaling near-linearly in n and integrating directly with low-rank and feature-map kernels common in ML. This paper focuses on the relaxation, solver, and scaling analysis; full real-data benchmarking is left to a planned empirical study.
[LG-84] FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning
链接: https://arxiv.org/abs/2606.19408
作者: Takanori Yoshimoto,Yang Hu,Naruya Kondo,Tatsuya Matsushima
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.
[LG-85] Agent Armor: A Framework Evaluation Mitigation of Coding Agent Failures
链接: https://arxiv.org/abs/2606.19380
作者: Kenneth Ge,Andre Assis
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes’’ policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.
[LG-86] A Hybrid GNN-FEM Framework for Phase-Field Fracture Simulation. Physics-Preserving Hybridization for Generalizable Surrogate Modeling
链接: https://arxiv.org/abs/2606.19378
作者: Hyeonbin Moon,Yongjin Choi,Seunghwa Ryu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 46 pages
Abstract:Scientific machine learning (SciML) has emerged as a promising approach for accelerating simulations of complex physical systems, yet achieving physically consistent and generalizable predictions for nonlinear, history-dependent problems remains a central challenge. In this study, we propose a hybrid GNN–FEM framework for efficient and generalizable phase-field fracture modeling. While phase-field approaches provide a robust variational framework for simulating complex crack evolution, their high computational cost limits practical applications because they require solving coupled, nonlinear, and history-dependent systems within an incremental finite element procedure. To address this challenge, a graph neural network surrogate is integrated into the conventional staggered scheme, replacing the phase-field update at each load increment while retaining the FEM-based displacement solver to enforce mechanical equilibrium and boundary conditions. By preserving the incremental solution structure, the framework remains consistent with history-dependent fracture evolution without requiring the surrogate to approximate the full solution trajectory. This selective surrogate strategy emphasizes the identification of a physically meaningful and incrementally structured learning target, rather than relying on brute-force data generation to learn the full fracture process. The proposed framework achieves strong generalization across varying geometries, loading conditions, material properties, and discretizations through dimensionless feature design, a graph-based formulation on mesh-based domains, and a physics-informed loss derived from the governing phase-field equation. Numerical experiments demonstrate that the hybrid approach reduces computational cost while maintaining accuracy compared with conventional FEM, and exhibits robust predictive performance across diverse problem settings.
[LG-87] Physics-Informed Discovery of Yield Functions in Plasticity via Convex Neural Representations
链接: https://arxiv.org/abs/2606.19375
作者: Hyeonbin Moon,Donghyuk Cho,Jecheon Yu,Jeong Whan Yoon,Seunghwa Ryu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 39 pages
Abstract:Identifying anisotropic yield functions remains challenging since yielding is not directly observed in full-field mechanical measurements, directional calibration can require many loading directions, and selecting an appropriate analytical form is nontrivial. This study proposes a physics-informed framework for discovering yield functions from full-field displacement data and reaction force data, without stress observations, plastic strain measurements, direct yield surface data, or a prescribed parametric yield function. The framework identifies the yield function as a mechanically constrained constitutive component inside elastoplastic stress integration, rather than through direct stress-space supervision. The yield function is represented by a convex neural network that enforces convexity and positive homogeneity of degree one while imposing the assumed tension-compression symmetry, and this neural yield function is trained with a differentiable stress update and a physics-informed force equilibrium loss across multiple loading cases. The proposed framework is validated using finite element (FE) benchmark studies with von Mises, Hill 1948, and Yld2000-2d yield functions, assessing yield contour agreement, displacement-noise sensitivity, identifiability through plastically active stress states, epistemic uncertainty, and polynomial-surrogate deployment. This study provides a mechanics-constrained pathway for discovering anisotropic yield functions from displacement and force data while keeping the identified component within the structure of elastoplastic stress integration.
[LG-88] Neural Architectures as Functional Priors in Physics-Informed Control Problems
链接: https://arxiv.org/abs/2606.19368
作者: Sonia Rubio Herranz,Fernando Carlos López Hernández,Antonio López Montes
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 17 pages, 6 figures. Physics-informed neural networks, optimal control, spectral bias, Kolmogorov-Arnold Networks
Abstract:In this work we investigate the role of neural architectures as implicit functional priors in control problems governed by ordinary differential equations. Rather than focusing on highly complex problems, our objective is to investigate architecture-dependent effects in controlled dynamical systems within the simplest physically interpretable settings possible. In particular, we study a controlled linear RLC electrical circuit and a nonlinear Duffing-type dynamical system. Both systems are analyzed first through classical optimal-control formulations and later through PINN-based approaches. We compare different combinations of multilayer perceptrons (MLPs) and Fourier-based KAN-like architectures, and analyze their influence on the resulting controls. The numerical experiments suggest that different architectural choices systematically generate qualitatively distinct controls, even under identical governing equations, loss functionals, initial and target states, training parameters and physical constraints. Significant differences appear in the spectral structure, smoothness, energy distribution, and phase-space behavior of the learned solutions. A central observation of this work is the emergence of a functional specialization phenomenon when the neural architectures are allowed sufficient freedom to shape the structure of the learned controls. More specifically, in the systems considered here, Fourier-based architectures tend to produce trajectories with richer oscillatory content, whereas smoother low-frequency-biased architectures tend to generate more regular and energetically efficient controls. This suggests that different functional components of the control problem may be handled more efficiently by different neural architectures, leading to an implicit specialization between state representation and control generation.
[LG-89] Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics
链接: https://arxiv.org/abs/2606.19367
作者: Tiexin Ding
类目: Machine Learning (cs.LG)
*备注: 21 pages, 14 figures
Abstract:Building on a two-parameter Weibull framework for diagnosing transformer weight distributions, we study why the Weibull weight-scale parameter \lambda grows, overshoots, and then relaxes during AdamW training. We derive a leading-order three-force decomposition of the squared weight norm from the AdamW update: an alignment force measuring the correlation between weights and the adaptive update direction, an injection force from adaptive step magnitude, and a decay force from decoupled weight decay. On self-trained Pythia-70M models with ground-truth optimizer moments, alignment dominates the rise phase, contributing 88-94% of the absolute force budget across four random seeds and remaining robust to super-weight removal. Near saturation, alignment and decay approach balance, explaining the transition from weight-scale growth to relaxation. These force dynamics directly govern the squared-norm component underlying \lambda(t) ; the remaining RMS-to-Weibull reconstruction offset is measurable and decomposes into bridge and integration components, totaling approximately 5-6% in densely sampled regions. To extend the analysis to real models where optimizer moments are unavailable, we introduce a spline displacement method that recovers the alignment force from sparse checkpoints with approximately 92-94% accuracy, about twice the naive two-point baseline. We further observe that the peak value of \lambda(t) varies with training-data coherence in our experiments, suggesting a data-dependent component of weight-scale growth that we leave to a controlled follow-up study. Code and data are available at this https URL.
[LG-90] Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures
链接: https://arxiv.org/abs/2606.19365
作者: Jeeho Ryoo,Yongchan Jung,Muhammad Ali Khaliq,Weidong Zhang,Jiatong Han,Byeong Kil Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly heterogeneous kernel behavior. This paper performs a comprehensive performance analysis of the state-of-the-art medical diffusion model, Med-DDPM, across three generations of NVIDIA architectures to study kernel-level runtime breakdowns, instruction-mix characteristics, memory system utilization, warp-level activities, and profiler priority-score estimates. We show that training is overwhelmingly dominated by cuDNN convolution and implicit-GEMM kernels, with inefficiencies arising from memory-access patterns, tensor-layout conversions, and limited Tensor Core utilization. Guided by these insights, we evaluate two architecture-aware optimizations TF32 Tensor Core activation and a 3D channels-last layout and demonstrate that they reduce SM cycles by up to 100x, cut dynamic instructions by 100x, raise Tensor Core utilization from 1.45 to 9.98x, and increase IPC by 7% on A100, all without degrading synthesis quality.
[LG-91] Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference
链接: https://arxiv.org/abs/2606.19364
作者: Abhinit Sen,Ajeet Kumar,Manaranjan Pradhan
类目: Machine Learning (cs.LG)
*备注: 19 pages, 7 tables, 1 figure, includes appendix
Abstract:The prefill stage of Large Language Model (LLM) inference is a growing contributor to cloud-scale energy cost. Many consumer-support and conversational prompts contain social scaffolding: politeness markers, apologetic preamble, repetition, and rapport-building language that is important for human communication but carries low marginal information for machine reasoning. We call this discrepancy the Social-Semantic Gap. We present SPSD (Sentiment Preserving Semantic Distillation), an edge-based pipeline that compresses user prompts using a 4-bit quantised Small Language Model before transmission to a cloud-deployed LLM. Evaluation on a 248-prompt corpus using Gemma-2-2B-Instruct (Q4_K_M) as the SLM and Llama-3.1-8B-Instruct as the cloud evaluation model yields a mean input token saving of 99.9 tokens per distilled call, with all 146 distilled calls yielding positive savings. Response quality, assessed by blind LLM-as-judge scoring across 121 pairs, is non-inferior to the raw path within a pre-specified 1-point margin on a 15-point rubric; the judge awarded 43 percent ties, 28 percent distilled wins, and 29 percent raw wins. Cosine similarity is mixed: mean 0.682, median 0.712, with 54.1 percent of pairs above the 0.70 reference threshold. Safety-critical domains are conservatively routed to passthrough via rule-based gates. Per-call net energy saving is estimated at 70-270 uWh under stated assumptions. SPSD shows that on-device prompt distillation can reduce cloud LLM input-token cost while preserving response quality within a practical non-inferiority margin.
[LG-92] When to Trust How to Distill: Multi-Foundation Model Guidance for Lightweight Robust Scientific Time Series Forecasting KDD2026
链接: https://arxiv.org/abs/2606.19363
作者: Rupasree Dey,Abdul Matin,Nathan Orwick,Yao Zhang,Shrideep Pallickara,Sangmi Lee Pallickara
类目: Machine Learning (cs.LG)
*备注: KDD 2026, paper decision: Accepted, track: AI for Science. total 12 pages including references and appendix
Abstract:The deployment of Time-Series Foundation Models (TSFMs) in physical sciences is hindered by a critical trade-off: while these models encode rich, universal temporal dynamics, they suffer from severe distributional misalignment when applied zero-shot to specific scientific domains, and their computational cost prohibits deployment in edge-computing sensor networks. We address a fundamental challenge: How can we extract latent structural knowledge from misaligned foundation models (FM) to train lightweight, specialized forecasters? We propose Gated Uncertainty-Aware Routing for Distillation (Guard), a novel framework that reframes multiteacher distillation as an instance-wise decision process with two adaptive mechanisms: (1) a Contextual Router that dynamically selects the most relevant teacher based on local input statistics, exploiting complementarity across diverse foundation models; and (2) an Uncertainty-Gated Temperature mechanism that acts as a “circuit-breaker,” automatically attenuating distillation strength when teacher confidence diverges from domain reality. We evaluate our proposed lightweight framework on four climate-critical domains: meteorology, ecosystem carbon flux, soil moisture, and energy grids. Our method significantly reduces RMSE relative to a fixed-weight multi-teacher distillation baseline, successfully distilling knowledge from pretrained FMs (teachers) even when they exhibit suboptimal zero-shot accuracy due to distribution shift between the original and target data domains. We demonstrate that these domain-misaligned teachers can still serve as critical correctives, outperforming the globally superior FMs on 28.5% of the hardest instances. Ultimately, this enables high-precision scientific forecasting suitable for resource-constrained edge deployment. Code is available at this https URL.
[LG-93] Entropy Estimation in Multi-Qutrit Systems via Variational and Classical Neural Networks
链接: https://arxiv.org/abs/2606.20504
作者: Sai Sakunthala Guddanti,Anil Prabhakar,Ria Rushin Joseph
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:We present a systematic study of von Neumann entropy estimation in multi-qutrit quantum systems using two complementary approaches: variational quantum algorithms (VQAs) and classical convolutional neural networks (CNNs), evaluated using an ideal (noise-free) quantum simulator. For systems up to three qutrits, we construct and evaluate 11 hardware-efficient SU(3)-inspired ansatzes. A parameter sweep shows that estimation accuracy is primarily determined by the number of trainable parameters, provided sufficient entanglement is present. Based on this study, we fix the parameter count to approximately 120 for subsequent experiments, observing that increasing entangling-gate counts beyond a threshold yields only marginal improvements. For larger systems (two to five qutrits), we use a CNN trained on measurement outcomes from tensor-product mutually unbiased bases. The model achieves accurate and stable predictions and exhibits a systematic improvement in performance with system size, with the highest errors for two-qutrit systems and the lowest for five-qutrit systems. Notably, using only 12.5% of the measurements required for full state tomography is sufficient to reach 90th-percentile absolute errors of approximately 0.13-0.16 nats for both four- and five-qutrit systems. The CNN model is also robust to shot noise and generalizes well to out-of-distribution states. Overall, within the simulated settings studied here, our results indicate a transition in practical methods: VQAs are effective for small systems, while CNN-based estimators offer improved scalability and robustness for larger qutrit systems.
[LG-94] SSH-Net: A Deep Neural Network for Predicting Failure Time Distribution Functions under Competing Risks with Application to GPU Data
链接: https://arxiv.org/abs/2606.20451
作者: Jie Min,Yueyao Wang,Mengkun Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:
Abstract:Competing risks are commonly observed in engineering fields and can bring challenges to time-to-event data modeling when the application scenarios are complicated. Recently, deep neural networks have received great attention for prediction with competing risks, due to their flexibility and high learning capability. However, the complexity of neural network structure brings extra difficulty in hyperparameter tuning based on different data inputs. Additionally, when an engineered system has complex physical structures with multiple hierarchical levels, treating all structural levels as a single group of inputs may fail to capture critical information. To address the issues, we propose a Structured Segmented Hazard Deep Neural Network (SSH-Net) for failure time prediction under cause-specific competing risks framework. Our approach associates neural network structure with data structures, and allows different covariate groups to impact the failure prediction through separate sub-networks. The neural network is constructed based on a cause-specific competing risks model. The SSH-Net outputs cause-specific hazard functions, and utilizes the penalized log-likelihood as the loss function. The prediction accuracy of SSH-Net is validated through simulation studies by evaluating the Brier score, the area under receiver operating characteristic curves (AUC), and the root mean square error (RMSE) of the predicted cause-specific cumulative incident function. We further demonstrate the model’s ability to predict failure time distribution functions using the Titan GPU failure time data.
[LG-95] HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction
链接: https://arxiv.org/abs/2606.20437
作者: Siqi Miao,Shitij Govil,Jack P. Rodgers,Mia Liu,Javier Duarte,Shih-Chieh Hsu,Yuan-Tang Chou,Pan Li
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注:
Abstract:Charged-particle tracking – reconstructing trajectories from sparse detector measurements – is a fundamental high-energy-physics inference problem and a canonical example of learning under extreme combinatorial ambiguity. At the High-Luminosity Large Hadron Collider (HL-LHC), tracking must remain accurate and efficient despite unprecedented collision densities. Graph neural networks perform strongly, but incur substantial costs from graph construction and processing, while transformer-based approaches rely on auxiliary stages that prevent end-to-end optimization. To address this, we present HEPTv2, an end-to-end point-transformer architecture that reconstructs tracks from detector hits in one trainable pipeline. HEPTv2 combines a locality-aware point encoder with a track decoder that predicts complete trajectories without graph-building, clustering, or filtering. The encoder uses locality-sensitive hashing in detector coordinate space to preserve tracking-relevant geometry while enabling efficient local attention. The decoder resolves ambiguities through sectorized decoding and direct hit-to-track prediction under joint encoder-decoder supervision, allowing the full pipeline to be optimized end-to-end. On TrackML, HEPTv2 achieves 98.6% double-majority tracking efficiency at a 0.8% fake rate, while requiring only \sim 15~ms inference time and 0.4~GB peak memory per event on a NVIDIA A100 GPU. Latency and memory scale approximately linearly for events with up to 5\times10^5 hits. HEPTv2 establishes a new state of the art in the accuracy-latency trade-off, improving efficiency by 4.5% over the strongest prior transformer and by 1.1–2.2% over optimized graph-based pipelines, while reducing latency by factors of 7 and 38–52, respectively. These results show end-to-end transformers can deliver the accuracy and efficiency required for real-time particle reconstruction at the HL-LHC.
[LG-96] Quantum ring all-reduce: communication and privacy advantages for distributed learning
链接: https://arxiv.org/abs/2606.20344
作者: María Gragera Garcés,Lirandë Pira
类目: Quantum Physics (quant-ph); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 23 pages, 1 figure
Abstract:Machine learning models have scaled to unprecedented sizes, making training across distributed devices the de facto standard in the field. In this work, we explore how quantum communications can make distributed training both more communication-efficient and information-theoretically private, for both classical and quantum learning models. Ring all-reduce is the foundational communication primitive for large-scale distributed training. We present a quantum version that reduces per-link online communication by a provably optimal factor of two using pre-shared entanglement and superdense coding, without requiring the learning model or gradient computation to change. Beyond bandwidth, the primitive enables privacy guarantees that are information-theoretically impossible for any classical protocol, achieving composable \epsilon-secure aggregation, via verified entanglement, at a 2x overhead in GHZ copies. Our hybrid quantum-classical communication architecture yields simultaneous communication and security advantages for large scale distributed training, regardless of whether the learning itself is quantum or classical. Finally, we characterise quantum advantages in gradient conflict detection for server-to-client communication under bandwidth constraints, a setting that arises after ring all-reduce is completed, when full gradient broadcast to external clients is infeasible. Two variants of the problem admit different separations. For margin-based alignment testing (\textscGapIP_\tau), the quantum advantage is quadratic in the margin parameter: \widetildeO(\tau^-1\log P) qubits versus \widetildeO(\min(\tau^-2,P)) bits. For sign-consistency auditing against a private parameter matching (\textscTieAudit_\epsilon), the advantage represents an exponential separation in communication complexity: \Omega(\sqrtP) bits whereas O(\epsilon^-2\log P) qubits suffice.
[LG-97] Statistical Properties of Training Generalization
链接: https://arxiv.org/abs/2606.20299
作者: Itay Lavie,Noam Levi,Yonatan Kahn
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 32 pages, 3 figures. Part of the VERaiPHY initiative
Abstract:Deep learning has managed to evade numerous intuitions from classical statistics to achieve unprecedented performance on a number of real-world tasks. In this article, we investigate the key features and surprises of deep learning from a physics-informed perspective, taking care to point out and justify where possible the many choices inherent in constructing a deep learning model. In particular, we review the phenomenon of neural scaling laws and discuss their interplay with the constraints and inductive biases which may be present when applying machine learning to problems in physics.
[LG-98] Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random ICML2026
链接: https://arxiv.org/abs/2606.20206
作者: Ziheng Wei,Annie Qu,Rui Miao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ICML 2026. 31 pages, 6 figures
Abstract:In offline Reinforcement Learning, immediate rewards in logged batch data are often unobserved due to sparse or irregular record-keeping, or censored beyond certain reward values. This issue arises in practical settings, including health care and marketing. We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions. To address this, we formalize a reward-dependent propensity model and use future states as shadow variables to identify the full-data conditional mean reward. We further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a min-max procedure to avoid double sampling. Building upon these identification results, we propose an Fitted-Q-Evaluation-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finite-sample error bounds for our OPE estimator, and show through experiments the strong performance of our method compared to existing methods on simulated and MIMIC-III Sepsis data.
[LG-99] Beyond Averag ing in John Ellipsoid Approximation: High-Accuracy Algorithms in the Leverag e-Score Model
链接: https://arxiv.org/abs/2606.20082
作者: Xiaoyu Li,Junwei Yu,Jiaojiao Jiang,Junbin Gao,Andi Han
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:The John ellipsoid of a symmetric polytope P=\mathbfx\in\mathbbR^d:|\mathbfA\mathbfx|\infty\le1\ , \mathbfA\in\mathbbR^n\times d , is computed by a long line of leverage-score algorithms, from Cohen, Cousins, Lee and Yang (COLT 2019) to its successors [WY24, CLS+25], all reaching a (1+\varepsilon) -approximation in \Theta(\varepsilon^-1\log(n/d)) iterations. We separate this complexity into three costs the modern line conflates (certification, identification, and accuracy) and locate the historical \varepsilon^-1 in the first alone. In the equivalent D-optimal-design form \min\mathbfp\in\Delta_n-\log\det(\sum_i p_i\mathbfa_i\mathbfa_i^\top) , the leverage-score oracle is exactly the first-order oracle and the (1+\varepsilon) -John guarantee the Frank-Wolfe gap g(\mathbfp)\le\varepsilon d ; through this dictionary the costs come apart. The \varepsilon^-1 is a certification artifact: the uniform average of the iterates, the certificate used throughout the line, has gap exactly \Theta(1/T) , however cheap each iteration is made. Pointed instead at the last iterate the same oracle is fast: a warm-started accelerated method reaches the guarantee in C(\mathbfA)+O(\sqrt\kappa\log(1/\varepsilon)) queries after an \varepsilon -independent setup C(\mathbfA) , and once the optimal face is identified the facial problem is an unconstrained self-concordant minimization whose Hessian the oracle recovers exactly, so damped Newton needs only O(\log\log(1/\varepsilon)) steps, for a total of C(\mathbfA)+O(d^2\log\log(1/\varepsilon)) queries. The accuracy dependence is thus doubly logarithmic after an \varepsilon -independent, condition-dependent setup; the open problem is the remaining identification cost (a condition-free bound on reaching the optimal face) and lower bounds. Accuracy is not the obstruction.
[LG-100] Optimal Coarse Correlated Equilibria in Mean Field Games: Linear Programming and No-Regret Learning
链接: https://arxiv.org/abs/2606.20062
作者: Luciano Campi,Federico Cannerozzi,Ioannis Tzouanas
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注: 55 pages, 3 figures
Abstract:We introduce optimal coarse correlated equilibria for continuous-time mean field games. A coarse correlated equilibrium is a randomized recommendation scheme from which no player can gain by ignoring the recommendation and switching to an alternative strategy. The problem is as follows: a moderator selects, among all mean-field coarse correlated equilibria, one that optimizes a prescribed performance criterion, which may differ from the representative player’s objective. After formulating the problem, we develop a linear programming (LP) formulation, prove the existence of optimal LP coarse correlated equilibria, and relate the LP characterization to the original probabilistic setting. Building on this characterization, we design a no-regret primal-dual algorithm, based on an equivalent Lagrangian formulation of the external-regret constraint, for learning such equilibria. We provide explicit convergence rates for the learning algorithm, and numerical examples illustrate the method.
[LG-101] Stochastic Linear Contextual Bandits with Bounded Noise: A Set-Membership Approach
链接: https://arxiv.org/abs/2606.20022
作者: Haonan Xu,Yingying Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 23 pages, 1 figure
Abstract:This paper considers stochastic linear contextual bandits (SLCB) with bounded reward noise. Existing works typically assume sub-Gaussian reward noise and bounded expected rewards, under which the optimal regret bound scales as \tildeO(\sqrtT) in terms of horizon T . However, in many applications, realized/observed rewards are also naturally bounded, implying bounded reward noise. Bounded noise is more informative than the sub-Gaussian condition but has not been leveraged explicitly in the SLCB literature. In this paper, we propose a novel algorithm SME-OFU by utilizing an uncertainty quantification method called set-membership estimation (SME) and applying the principle of optimism in the face of uncertainty (OFU). Our algorithm enjoys an improved regret bound O(\log T) . Notice that this does not contradict the existing optimal bound \tildeO(\sqrtT) for sub-Gaussian noise because bounded noise is a stronger condition. Finally, simulations show empirical improvements of SME-OFU over a benchmark algorithm designed for sub-Gaussian noise when the reward noise is bounded.
[LG-102] QMaxCal: Path-Space Regularization for Open Quantum Control via Girsanovs Theorem ICML2026
链接: https://arxiv.org/abs/2606.19947
作者: Merijn Moody,Zier Mensch,Miranda C. N. Cheng,Peter G. Bolhuis,Max Welling
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 26 pages, 6 figures. ICML 2026 AI4Physics Workshop
Abstract:Reliable quantum control in the presence of decoherence requires policies that combat the effect of environmental noise on the controlled dynamics. Open quantum systems under continuous monitoring generate classical measurement records whose drift depends on the noise experienced by the system; the records of two evolutions sharing the same decoherence channels differ only in this drift, so Girsanov’s theorem yields a closed-form, differentiable estimator of the KL divergence between their trajectory distributions. We instantiate this estimator with two physically motivated reference measures, yielding two regularizers that both drive the system toward states where the effects of decoherence are minimal: the Wiener KL (KL_W), which is empirically more effective under certain conditions on the noise model, and the drift-variance regularizer (R_DV), which works for all noise models. Both are qualitatively distinct from existing penalties on control fluence or smoothness: they penalize the observable consequences of control on the decoherence channels rather than the control amplitude itself. The regularizers outperform unregularized gradient-based and reinforcement-learning baselines across a range of open quantum systems – including single- and multi-qubit benchmarks and a multi-qubit chain calibrated to a published snapshot of the IBM Kingston processor – along several axes of evaluation: final-state fidelity, robustness to mismatch in the assumed noise model (gains grow from +17 pp at training noise to +27 pp under 2.5x noise mismatch), and occupation of forbidden states. The regularizers reduce infidelity by up to 50%, with ~16% gains on the calibrated IBM Kingston chain.
[LG-103] Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning INTERSPEECH2026
链接: https://arxiv.org/abs/2606.19823
作者: Satwinder Singh,Qianli Wang,Zihan Zhong,Clarion Mendes,Hasegawa-Johnson,Waleed Abdulla,Seyed Reza Shahamiri
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted to Interspeech 2026, Sydney, Australia
Abstract:Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. We investigate zero-shot voice cloning as a low-burden augmentation strategy, using Higgs Audio V2 to clone speakers in the TORGO dataset. We fine-tune (FT) Whisper-medium on cloned, real, and hybrid data and evaluate on held-out real speech. Compared to the zero-shot (31.62%), Clone FT achieved a competitive 26.00% WER, nearly matching the 24.44% and 25.12% seen with Real and Hybrid FT, respectively. Notably, Clone and Hybrid FT outperform Real FT for moderate-severe speakers. Clone FT achieves the best results (11.45% relative) in cross-corpus evaluation on the SAP-1102. These results suggest that zero-shot cloning provides scalable training data that circumvents the costly data collection bottleneck.
[LG-104] Variational Consensus Monte Carlo for Bayesian Mixture
链接: https://arxiv.org/abs/2606.19643
作者: Julie Fendler,Francesca L. Crowe,Tom Marshall,Sylvia Richardson,Paul D.W. Kirk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Motivated by the privacy, sensitivity and sharing limitations of health data, we present a comprehensive pipeline for inference of Bayesian mixture models within a federated learning setting, i.e. when data cannot be fully shared or pooled across compute nodes. We adopt a Consensus Monte Carlo (CMC) approach, in which an MCMC algorithm is run independently within each data silo to estimate local posterior distributions, which are then aggregated to approximate the posterior over the full data. The variational CMC approach of Rabinovich, Angelino and Jordan (2015) [1] frames the aggregation step as a variational inference problem, but their application to mixtures assumes the number of clusters and key mixture parameters to be known. Our main methodological contributions are: (i) an extension of variational CMC to over-fitted Bayesian mixture models that infer the number of clusters and all model parameters, without requiring conjugacy; (ii) novel cluster-matching algorithms suitable for cross-silo settings in which not every cluster appears in each local dataset; (iii) a number of inference strategies for the aggregation step, matched to different federated learning constraints; and (iv) guidelines for choosing among these in practice. A comprehensive simulation study validates the framework and allows us to compare to state-of-the-art federated learning alternatives. Notably, we show that when the composition of local datasets reflects the underlying clustering structure in the data, our approach can recover small clusters with greater accuracy than standard MCMC applied to the pooled data. We illustrate the framework on large-scale electronic health record data, identifying multi-morbidity patterns in a British geriatric population.
[LG-105] A Solver-Free Training Method for Predict-then-Optimize ICML2026
链接: https://arxiv.org/abs/2606.19587
作者: Beichen Wan,Mo Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted by ICML 2026
Abstract:We propose a scalable method for training prediction (machine learning) models in the predict-then-optimize paradigm, where model outputs serve as coefficients for a subsequent linear optimization task. Directly minimizing the empirical decision regret is intractable for linear programming and combinatorial optimization since the decision mapping is piecewise constant, and the gradients are zero almost everywhere. While existing methods address this by smoothing the differentiation process, they suffer from scalability issues, since a computationally expensive solver call is required for every gradient evaluation. To address this, we propose a decision-focused learning pipeline based on a measure transformation principle, which yields a new surrogate loss that is completely optimization-solver-free during training. We establish theoretical guarantees, including Fisher consistency and excess risk bounds. Empirically, our method achieves decision quality competitive with state-of-the-art methods while reducing training time by orders of magnitude.
[LG-106] Optimal Ansatz-free Hamiltonian Learning In Situ
链接: https://arxiv.org/abs/2606.19486
作者: Taiqi Zhou,Weiyuan Gong
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 51 pages, 2 figures
Abstract:Characterizing the features of a Hamiltonian that governs a quantum system serves as a fundamental subroutine of quantum device calibration, signal sensing, and error correction. Recent works proposed protocols have achieved the optimal Heisenberg-limited scaling learning ansatz-free Hamiltonians from their real-time evolutions without fully specifying interaction structures. However, these protocols rely on both deep circuits with interleaving probes and control, and extremely short time resolution, making them difficult to implement on near- and intermediate-term in situ quantum experiments. In this work, we propose a computationally efficient, control-free, and ancilla-free algorithm that uses only Pauli product state preparation and measurement, and learns an ansatz-free Hamiltonian H with ||H||\leq\Lambda in total evolution time of \Theta(\frac\Lambda\epsilon^2\log(\frac\Lambda\epsilon)) . The evolution time cost of our algorithm is optimal for any control-free protocols as we further prove a lower bound of \Omega(\frac\Lambda\epsilon^2\log(\frac\Lambda\epsilon)) . Technically, our method introduces a randomized-sampling framework that combines band-limited kernel-based time sampling with a displacement sieve for Hamiltonian structure learning. The characteristic probe time resolution depends only on \Lambda instead of \varepsilon , which makes our protocol especially appealing in the high-precision regime for sensing and calibration applications. We also show that the algorithm maintains the same asymptotic total evolution time in the presence of state-preparation-and-measurement (SPAM) noise when the Hamiltonian is local after calibration. Our results demonstrate the fundamental cost of experimentally friendly Hamiltonian learning and provide a practical route to rigorous in situ characterization of near-term quantum platforms.
[LG-107] he Representational Limit of Scalar Interactions: An Interventional Decomposition
链接: https://arxiv.org/abs/2606.19410
作者: Potito Aghilar,Sabino Roccotelli,Stanislao Fidanza,Vito Walter Anelli,Sebastiano Stramaglia,Tommaso Di Noia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Signed pairwise interaction scores fundamentally conflate uniqueness (U), redundancy ®, and synergy (S). We prove this on a minimal 3-way XOR structural causal model: faithful indices such as Shapley-Taylor return zero per pair, whereas projective indices such as Shapley Interaction spread the third-order effect into pair scalars that conflate the three mechanisms. We introduce Stochastic Hi-Fi, a post-hoc, retraining-free predictability decomposition that estimates per-feature U/R/S profiles by interventional masked inference. The estimator provides exact interventional semantics, finite-sample Monte Carlo bounds, strict variance reduction from coupled diamond sampling, and uniform finite-vocabulary convergence. Across tabular SCMs, Stochastic Hi-Fi recovers structure missed by scalar baselines (up to 411x larger interaction-magnitude recovery ratios). It also separates redundant and synergistic heads in the GPT-2 IOI circuit. On NIH ChestX-ray14, Stochastic Hi-Fi matches GradCAM on Pointing Game and improves substantially on Deletion AUC.
[LG-108] he Morse Transform for Discrete Shape Analysis
链接: https://arxiv.org/abs/2503.04507
作者: Alexander M. Tanaka,Aras T. Asaad,Richard Cooper,Vidit Nanda
类目: Quantitative Methods (q-bio.QM); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: 37 pages, 3 main figures, 2 main tables, 12 appendix figures and 4 appendix tables
Abstract:The geometry of an object plays a vital role in modulating its interactions with the physical world. It nevertheless remains difficult to describe geometric information numerically for the purposes of statistical inference or classification tasks. Here, we introduce a new topological transform which leverages directional piecewise-linear Morse theory to quantify the geometry of an embedded object by cataloguing critical points across multiple height-functions. The output of this Morse transform records both the heights and the local topological type (peak, trough or saddle) of the critical points that characterise the underlying shape, retaining finer information than the Euler characteristic transform whilst naturally prioritising a shape’s outermost regions. Crucially, this output can be further compressed into a rich but compact feature vector. We benchmark the Morse feature vector as a descriptor for ligand-based virtual screening (LBVS), which intrinsically depends on the shape of molecules. Under a common gradient-boosted tree classification pipeline, Morse descriptors achieve the highest mean AUROC when compared to other topological transform descriptors and to standard shape-based LBVS descriptors.
附件下载


