Arxiv今日论文 | 2026-05-25

本篇博文主要内容为 2026-05-25 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共86篇(Computation and Language (cs.CL))
人工智能共164篇(Artificial Intelligence (cs.AI))
计算机视觉共109篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共203篇(Machine Learning (cs.LG))
多智能体系统共17篇(Multiagent Systems (cs.MA))
信息检索共16篇(Information Retrieval (cs.IR))
人机交互共21篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

【速读】：该论文旨在解决时间知识图谱数据市场中静态设计所面临的三个耦合问题：过时的混合索引快捷路径导致召回率下降，静态Shapley定价在分布偏移后产生价值误分配，以及未协调的代理过度消耗共享差分隐私预算。解决方案的关键在于提出CHRONOS三层架构，通过显式的公共与私有分离统一处理上述挑战：第一层采用神经-ODE时间衰减机制对快捷路径进行动态更新，提供查询期望召回损失的Big-O界并保证单调包络以减少边界松散度；第二层基于检测到的变化点条件化Shapley估值，在噪声下提供有限样本误差保障；第三层使用EXP3-IX算法实现次线性 regret 并通过矩会计法满足ε和δ差分隐私约束，同时每轮释放一个经过高斯机制加噪的亲和矩阵，所有检索与排序均为后处理操作，不引入额外隐私成本。实验表明，CHRONOS在多个基准上实现了0.937的召回率、2.74查询/秒吞吐量、161ms延迟及总ε=4.25（δ=1e-6）的隐私水平，展现出良好的运行效率与隐私-效用平衡能力。

链接: https://arxiv.org/abs/2605.23887
作者: Joydeep Chandra
机构: BNRIST; Tsinghua University (清华大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Temporal knowledge-graph data marketplaces face three coupled failures in static designs: stale hybrid index shortcuts reduce recall as edges evolve, stationary Shapley pricing misattributes value after distribution shifts, and uncoordinated agents over-consume a shared differential-privacy budget. We present CHRONOS, a three-layer architecture providing a unified treatment of these challenges with explicit public and private separation. Layer one applies neural-ODE temporal decay to shortcut edges, providing a per-query expected recall-loss bound of Big-O of Pq lambda delta t, with a monotone-envelope guarantee reducing bound looseness to 1.8 to 3.2 times observed loss. Layer two conditions Shapley valuation on detected changepoints and provides finite-sample error guarantees under noise. Layer three uses EXP3-IX to achieve Big-O of the square root of T log T regret while enforcing epsilon and delta differential privacy via moments accounting. CHRONOS releases a privatized affinity matrix per epoch using the Gaussian mechanism; all retrieval and ranking are post-processing, incurring no extra privacy cost. We provide multi-epoch settlement, scalability analysis for 500 sellers, and comparisons against accelerated baselines. Across four benchmarks, CHRONOS shows 0.937 recall at ten, 2.74 queries per second, 161 ms latency, and total epsilon of 4.25 at delta of 10 to the power of negative 6 under zCDP composition. These results indicate a competitive operating point. A limitation is that at this privacy level, released valuations remain noise-dominated; utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.

[MA-1] PhotoFlow: Agent ic 3D Virtual Photography Missions

【速读】：该论文旨在解决虚拟摄影（virtual photography）中如何在未预设相机位姿和参考图像的3D场景中，根据语言指令生成符合美学且空间合理的照片问题。其核心挑战在于同时实现复杂的三维空间理解与抽象的审美判断，而这两者此前难以协同评估。解决方案的关键在于提出PhotoFlow代理框架，该框架采用“导演-评审-反思”三阶段闭环搜索机制：导演（Director）构建软性摄影蓝图并提出多样候选相机位姿；评审（Reviewer）结合规则校验、视觉批评和成对优胜选择进行筛选；反思（Reflector）则将失败案例转化为区域记忆、死区抑制与高探索重定位策略。实验表明，在47个开放许可Blender场景和141个语言条件摄影任务上，PhotoFlow在六轮渲染预算内实现了最强的外部质量一致性与成功率，首次将任意Blender场景中的语言驱动虚拟摄影变为可执行的智能代理任务。

链接: https://arxiv.org/abs/2605.23771
作者: Jiarui Guo,Haojia Wei,Yiming Zhang,Yifei Liu,Yuning Gong,Hongjie Zhang,Xue Yang,Zhihang Zhong
机构: Shanghai Jiao Tong University (上海交通大学); Northeastern University (东北大学); University of California, Los Angeles (加州大学洛杉矶分校); Cornell University (康奈尔大学); Shanghai AI Laboratory (上海人工智能实验室); Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.

[MA-2] he Communication Complexity of Instant-Runoff Voting

【速读】：该论文旨在解决即时决选投票（Instant-Runoff Voting, IRV）的通信复杂度问题，即在最优信息获取协议下，n名选民向中央机构传输的最坏情况比特数。此前，Conitzer 和 Sandholm（2005）给出了上界 O(n (log m)²)，但未提供紧致的下界，仅给出 Ω(n log m) 的弱下界。本文通过使用“迷惑集”（fooling set）技术，将下界提升至 Ω(n (log m)²)，从而证明 IRV 的通信复杂度为 Θ(n (log m)²)，解决了这一长期开放问题。此外，论文进一步表明，在单峰偏好（single-peakedness）约束下，复杂度降至 Θ(n log m)；同时，IRV-Average 变体和 IRV 的多席位扩展 STV（Single Transferable Vote）也具有与 IRV 相同的渐近通信复杂度。关键突破在于利用组合方法严格证明了下界，揭示了 IRV 在一般情形下的高通信成本本质。

链接: https://arxiv.org/abs/2605.23743
作者: Élie de Panafieu,François Durand,Jérôme Lang(LAMSADE, CNRS)
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The communication complexity of a voting rule is the worst-case number of bits that n voters must transmit to a central authority under the most efficient elicitation protocol in an election with m candidates. We study the communication complexity of Instant-Runoff Voting (IRV). Conitzer and Sandholm [2005] established an upper bound of O(n (log m) ^2 ), but did not provide a matching lower bound beyond \Omega (n log m). We resolve this open problem by raising the lower bound to \Omega (n (log m) ^2 ) using the fooling set technique, thereby showing that the communication complexity of IRV is \Theta (n (log m) ^2 ). We further show that this complexity drops to \Theta (n log m) under the single-peakedness restriction, and that both the IRV-Average variant and Single Transferable Vote (STV), the multiwinner extension of IRV, have the same asymptotic communication complexity as IRV.

[MA-3] Safety Liveness and Fairness in Quantitative Argumentation Dialogues

【速读】：该论文试图解决在量化（双极）论证对话中，如何形式化和保证论证强度的稳定性与合理性问题，尤其是在论证图随时间更新、推理过程重复进行的场景下。解决方案的关键在于引入安全（safety）、活性（liveness）和公平性（fairness）等时序逻辑概念，并将其扩展至带权节点的论证图结构中：强安全性和弱安全性分别刻画论证最终强度是否始终高于或最终能到达某一合理阈值；活性确保论证强度在阈值上下波动以维持动态推理；公平性则评估安全论证在一系列论证图中的分布均衡性。论文通过形式化这些性质之间的关系，揭示了其内在逻辑联系，并指出在提供通用保障方面所面临的分析挑战。

链接: https://arxiv.org/abs/2605.23578
作者: Arunavo Ganguly,Julian Alfredo Mendez,Timotheus Kampik
机构: Umeå University (于默奥大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We introduce notions of safety, liveness, and fairness, as commonly used in temporal reasoning, to quantitative (bipolar) argumentation dialogues where repeated inferences are drawn from argumentation graphs with weighted nodes. Between inferences, these graphs undergo updates. Strong and weak safety capture that arguments’ (final) strengths remain above a specific threshold of justification and always reach the threshold eventually, respectively. Liveness requires that arguments’ strengths fluctuate across the threshold of justification. Fairness notions assess how safe arguments are spread within a sequence of argumentation graphs. We formally show how these notions are related, and discuss some analytical challenges with respect to providing general guarantees for our properties.

[MA-4] ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

【速读】：该论文旨在解决多智能体强化学习（MARL）中稀疏奖励导致的学习效率低下问题，尤其是在多个智能体同时学习时因非平稳性带来的策略优化困难。其解决方案的关键在于提出了一种名为ARMS（Automatic Reward-shaping in Multi-agent Systems）的自监督奖励塑形框架，通过轨迹排序自动学习密集的塑形奖励信号，从而缓解稀疏奖励的挑战。该方法的核心创新在于将策略不变性重新定义为条件最优响应推理，并证明在特定条件下，使用塑形奖励能够保持每个智能体在固定对手策略下的最优响应集，进而保留纳什均衡点集合。ARMS通过在策略学习和奖励学习之间交替迭代，并共享塑形参数以提升效率，在部分可观测多智能体路径规划任务中验证了其优越性：不仅提升了采样效率并适应更高的奖励稀疏性和智能体数量，还揭示了MARL特有的振荡行为失败模式，表明增加探索可稳定学习过程。这是首个基于博弈论均衡保持理论设计的自动奖励塑形框架。

链接: https://arxiv.org/abs/2605.23562
作者: Elie Abboud,Oren Gal
机构: University of Haifa (海法大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose Automatic Reward-shaping in Multi-agent Systems (ARMS), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent’s best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy–reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.

[MA-5] Arrow-Type Impossibility for Genuinely Modal Judgments

【速读】：该论文试图解决的问题是：在判断聚合（judgment aggregation）框架下，当个体判断对象从传统的事实命题扩展到真正的模态判断（modal judgments）时，经典的不可能性结果是否仍然成立。传统研究指出，在逻辑上相互关联的命题集合中，满足自然聚合公理会导致独裁现象；但这一结论是否适用于模态逻辑情境仍不明确，因为需排除以模态形式包装的事实命题聚合。论文的关键解决方案在于：首先提出一个语义归约定理（semantic reduction theorem），表明某些迭代模态模式可通过调整评估点进行简化；其次基于此归约机制，揭示了一种“局部到全局”的框架结构机制——即框架几何特性可生成最小不一致的模态判断集，并满足实现不可能性所需的强路径连通性条件。由此证明，仅靠语义结构本身即可产生导致独裁的逻辑关联，即便在极为简化的循环框架和单变量生成的模态议程中也依然如此。此外，该归约方法还将一致性检查转化为一个小规模组合覆盖问题，从而支持非独裁聚合程序的高效实现。

链接: https://arxiv.org/abs/2605.23321
作者: Yutaka Nagai,Hirotaka Ono
机构: Nagoya University (名古屋大学)
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 24 pages

点击查看摘要

Abstract:Judgment aggregation studies how to combine individual judgments on logically related propositions into a collective judgment. Classical impossibility results show that sufficiently strong logical interconnections force dictatorship under natural aggregation axioms. In this paper, we ask whether such impossibility can still arise when the objects of aggregation are required to be genuinely modal judgments rather than plain factual propositions. Since modal logic contains propositional logic, this question is meaningful only if one excludes fact-based aggregation in disguise. We show that Arrow-type impossibility already re-emerges in a strikingly sparse modal setting. We prove an impossibility theorem on a simple cyclic frame for an agenda generated from a single propositional variable by repeated applications of a single modal operator, and we further demonstrate this phenomenon for an alternative family of frames satisfying a natural symmetry condition. Thus, even under a modal-operator requirement, semantic structure alone can generate the logical interconnections needed for dictatorship. Technically, our analysis has two layers. First, we prove a semantic reduction theorem showing that certain iterated modal patterns can be collapsed by shifting the evaluation point. Second, building on this reduction, we identify a local-to-global frame mechanism by which frame geometry yields minimally inconsistent modal judgment sets and the strong path-connectivity required for impossibility. The same reduction also turns consistency checking into a small combinatorial covering problem, which yields efficient implementations of non-dictatorial aggregation procedures.

[MA-6] Self-Refining Topology Optimization via an LLM -Based Multi-Agent Framework

【速读】：该论文旨在解决拓扑优化（Topology Optimization）流程中人工决策干预过多的问题，即工程师在优化过程中需频繁调整数值参数、评估设计是否满足物理可行性等超出显式约束的隐含要求，从而阻碍了自动化设计的实现。解决方案的关键在于提出一种名为TopOptAgents的多智能体系统，该系统由六个基于大语言模型（LLM）的智能体组成，通过迭代自精炼（iterative self-refinement）循环协作完成问题建模、验证、代码生成与执行以及结构质量评估等关键阶段。这种机制不仅能够纠正错误，还能逐步改进优化设置和最终设计，尤其在预训练语言模型接触较少的问题类别（如文献和开源实现稀缺的场景）中表现出显著优势，有效扩展了LLM驱动自动化在拓扑优化中的适用范围。

链接: https://arxiv.org/abs/2605.23273
作者: Hyunjee Park,Hayoung Chung
机构: Ulsan National Institute of Science and Technology (UNIST, 釜山国立科学技术院)
类目: Multiagent Systems (cs.MA)
备注: 28 pages, 17 figures

点击查看摘要

Abstract:Topology optimization is a widely used design method that produces optimized material distributions for prescribed objectives and constraints through well-established numerical algorithms. Throughout the workflow, engineers make a series of decisions ranging from setting and adjusting numerical parameters to assessing whether the converged design meets considerations beyond those explicitly included in the optimization problem, such as physical feasibility. These decisions, which draw on domain expertise, interfere with the autonomous design process. To address this difficulty, this study presents TopOptAgents, a multi-agent system for automating not only the design process but also decision-making during the key stages of the topology optimization process. TopOptAgents consists of six LLM-based agents collaborating through iterative self-refinement cycles spanning problem formulation, validation, code generation and execution, and quality assessment of the optimized structure. This process enables error correction and progressive improvement of both the optimization setup and resulting design. The framework is demonstrated on optimization problems selected to cover a range of settings that differ in their literature coverage and numerical characteristics The benefits of iterative self-refinement are found to be particularly pronounced for problem classes where the pretrained language model has limited prior exposure, such as formulations whose literature and open-source implementations are comparatively sparse. In such cases, the proposed framework reliably produces converged designs where a single state-of-the-art LLM struggles, suggesting that self-refinement broadens the range of topology optimization problems that LLM-based automation can reliably address.

[MA-7] GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

【速读】：该论文试图解决的问题是：当前大型语言模型（LLMs）在真实市场、拍卖和竞价等复杂战略环境中行为难以预测，而现有评估基准多基于固定经典博弈，存在性能饱和且无法有效泛化到多样化、非理想化的实际部署场景的局限性。解决方案的关键在于提出GENSTRAT框架，其核心包括两个创新：一是通过程序化生成两玩家零和不完美信息纸牌游戏构建动态、可持续更新的评估环境，避免基准污染并支持长期测试；二是引入能力剖面方法（capability-profile methodology），将模型能力分解为六个维度（状态空间、时间深度、信息敏感度、对手建模、风险偏好和脆弱性），结合“锯齿度”（jaggedness）指标量化模型在结构相似游戏间表现波动的不稳定性，从而提供比单纯平均得分更细致、更具部署指导意义的诊断工具。实验证明，即使模型整体强度相近，其能力分布与局部稳定性差异显著，揭示了仅靠排名无法捕捉的战略适应性差异。

链接: https://arxiv.org/abs/2605.23238
作者: Vartan Shadarevian,Kia Ghods,Alex Kenich,Anany Kotawala
机构: Princeton University; Google
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 33 pages, 8 figures, 9 tables (4 figures, 2 tables in main paper)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model’s advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

[MA-8] CultivAgents : Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening

【速读】：该论文试图解决现有数字园艺工具因提供泛化建议而忽视园艺者技能、本地生态条件、季节变化及文化背景的问题。其解决方案的关键在于提出一个以关系为中心的多智能体系统——CultivAgents，该系统由三个专业化智能体协同工作：经验智能体（Experience Agent）根据用户技能水平个性化调整指导；环境智能体（Environmental Agent）将建议锚定于本地和季节性生态条件；民族植物学智能体（Ethnobotanical Agent）连接植物与文化知识和历史。通过三阶段混合方法研究验证，结果表明CultivAgents显著提升了社区园丁的信心、动机和对AI建议的信任度，凸显了超本地生态指导和多智能体互补视角的价值，同时指出了在文化特异性、生态贴合度和智能体协调方面的改进空间。该研究推动了以关系为导向的人工智能发展，为支持食物主权、社区韧性和文化传承的多智能体系统设计提供了重要启示。

链接: https://arxiv.org/abs/2605.23193
作者: Yiyang Wang,Moeiini Reilly,Britney Johnson,Kefei Yan,Alex Cabral,Josiah Hester
机构: Georgia Institute of Technology (佐治亚理工学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: Preprint, 9 pages. Website: this https URL

点击查看摘要

Abstract:Gardening is critical to support well-being, cultural continuity, and food autonomy, yet existing digital tools often provide generic advice that overlooks gardeners’ skills, local ecologies, seasons, and cultural contexts. We introduce CultivAgents, a relationship-centered multi-agent system for personalized, socio-culturally grounded gardening support. Grounded in ethics of care, CultivAgents coordinates multiple specialized agents: an Experience Agent that adapts guidance to users’ skill levels, an Environmental Agent that grounds advice in local and seasonal conditions, and an Ethnobotanical Agent that connects plants to cultural knowledge and histories. We evaluated CultivAgents through a three-phase mixed-methods study with domain experts (n=3), HCI researchers (n=7), and community gardeners (n=5), analyzing expert feedback, pre/post surveys, and participatory design activities. Results suggest that CultivAgents helped gardeners translate interest into situated action: community gardeners reported increased confidence (3.00 to 3.60), motivation (4.00 to 4.40), and trust in acting on AI advice (3.20 to 4.00). Participants valued hyperlocal ecological guidance and complementary agent perspectives, while also identifying limits in cultural specificity, ecological grounding, and agent coordination. The work advances relationship-centered AI, offering design implications for multi-agent systems that support food sovereignty, community resilience, and cultural preservation.

[MA-9] SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate

【速读】：该论文试图解决多智能体辩论（Multi-Agent Debate, MAD）中因上下文快速膨胀而导致的可扩展性问题，尤其是在大规模多智能体场景下，传统方法依赖先验信号（如词元级对数似然或LLM自我报告置信度）来剪枝低效通信，但这些信号在幻觉（hallucination）情况下变得不可靠，从而损害了MAD方法的准确性。解决方案的关键在于提出SVR-MAD，一个受贝叶斯启发的MAD框架：它将辩论前的信号视为先验，将辩论结果视为后验样式的证据，用于估计智能体答案的正确性；并基于此证据增量式构建通信图，优先保留能经受同伴质疑的智能体的答案。实验表明，SVR-MAD在多个大语言模型和基准测试上将token消耗降低高达61%，同时保持或超越当前最优MAD基线的准确性。

链接: https://arxiv.org/abs/2605.23099
作者: Weifan Jiang,Rana Shahout,Minghao Li,Zhenting Qi,Yilun Du,Michael Mitzenmacher,Minlan Yu
机构: Harvard University
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-Agent Debate (MAD) improves LLM-agent accuracy but suffers from rapid context growth, limiting scalability in larger multi-agent settings. Existing methods prune low-utility communications using prior signals, such as token-level log-likelihoods or LLM self-reported confidence. However, these signals become unreliable under hallucination, degrading the accuracy of MAD methods that rely on them. We propose SVR-MAD, a Bayesian-inspired MAD framework that treats pre-debate signals as priors and debate outcomes as posterior-style evidence for estimating agent correctness. SVR-MAD uses this evidence to incrementally construct the communication graph, prioritizing agents whose answers survive peer challenges. Experiments across multiple LLMs and benchmarks show that SVR-MAD reduces token cost by up to 61% while matching or improving accuracy relative to the most accurate competing MAD baseline.

[MA-10] How to Steer Your Multi-Agent System: Human-LLM Collaborative Planning

【速读】：该论文试图解决在协同多智能体系统中，人类因计划复杂性和透明度不足而难以有效管理的问题。现有方法仅依赖结果层面的监督（outcome-level supervision），用户只能验证最终输出，无法观察中间推理过程。解决方案的关键在于提出一个三轴设计空间（模式：语义 vs. 结构；范围：全局 vs. 局部；层级：低级 vs. 高级编辑），并实现为AMBIPO（Agent-Model co-planning with Process-level Oversight），支持通过语义和结构两种交互方式实现过程层面的监督（process-level supervision）。研究通过用户实验揭示了混合工作流及努力-控制-风险权衡关系，并通过受控基准测试分析LLM在不同范围和修订策略下的计划修正行为，从而为更透明、可控且高效的“人-AI协同规划”提供设计洞见。

链接: https://arxiv.org/abs/2605.23023
作者: Zeyu He,Hannah Kim,Dan Zhang,Estevam Hruschka
机构: Penn State University (宾夕法尼亚州立大学); Megagon Labs (梅加贡实验室)
类目: Multiagent Systems (cs.MA); Human-Computer Interaction (cs.HC)
备注: ACM Conference on AI and Agentic Systems (CAIS) 2026

点击查看摘要

Abstract:In orchestrated multi-agent systems, humans often struggle to manage plans due to their complexity and limited transparency. Existing approaches rely on outcome-level supervision, where users verify only final outputs without visibility into intermediate reasoning. We formalize a design space for human-LLM co-planning interactions along three axes: mode (semantic vs. structural), scope (global vs. targeted), and level (low vs. high-level edits). We realize it in AMBIPOM, a prototype supporting process-level supervision through both semantic and structural interactions. Through a user study, we characterize how users navigate this space, revealing hybrid workflows and effort-control-risk trade-offs; through a controlled benchmark, we analyze how LLMs revise plans under varying scope and revision strategies. Our findings yield design insights for more transparent, controllable, and effective human-AI co-planning. We release code and data at this https URL.

[MA-11] MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

【速读】：该论文旨在解决多智能体系统中协调者难以准确选择可信响应的问题，特别是由于基础模型（foundation model）的自报告置信度存在系统性校准偏差，尤其在困难任务上甚至与准确性呈负相关。传统设计时校准方法（如温度缩放、Platt缩放）无法应对分布偏移问题，因为它们依赖于固定校正函数且在数据分布变化时性能下降。解决方案的关键是提出MARGIN（Multi Agent Runtime Grading via Incremental Normalization），一种无需模型访问、无需保留数据集、无需重新训练的在线校准方法，它通过任务流本身学习每个智能体在不同置信度区间内的校准因子，采用对称指数加权移动平均结合贝叶斯收缩融合机制，具备三个超参数且默认值表现稳健。实验表明，MARGIN在19个基础模型、8个基准测试和超过5万个观测样本中，相较于最优设计时基线，在分布偏移下校准误差降低3-6倍；同时显著提升多智能体选择中的配对分辨能力，从原始置信度的45-56%提升至70-89%，并在四个基准中的三个超越“始终选择最佳模型”的基准。

链接: https://arxiv.org/abs/2605.22949
作者: Joss Armstrong
机构: Ericsson(爱立信)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent’s response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically mis-calibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi Agent Runtime Grading via Incremental Normalization), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and surpassing the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically. Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2605.22949 [cs.LG] (or arXiv:2605.22949v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22949 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-12] Multi-Dimensional Matching in Market Design

【速读】：该论文旨在解决多维匹配市场（multi-dimensional matching markets）中因 agents 需要对全部对象进行完整效用评估而导致的计算复杂性和认知负担问题。传统方法在处理高维特征空间时面临“维度灾难”，难以高效求解最优匹配。其解决方案的关键在于利用奇异值分解（Singular Value Decomposition, SVD）识别特征空间中的主变异性方向，并沿此单一维度进行匹配，从而将原本复杂的多维匹配问题降维为一个可在 $O(N \log N)$ 时间内求解的一维问题。该机制在数据具有低有效维度时近似最大化纳什社会福利（Nash Social Welfare），同时满足分布鲁棒性下的策略一致性（distributional truthfulness）和对称性，并通过建立纳什社会福利与几何分布鲁棒优化（Geometric Distributionally Robust Optimization）之间的新联系，提供理论上的鲁棒保障。数值实验表明，该方法可实现接近99%最优社会福利，且运行速度比直接优化快三个数量级，适用于学校选择、劳动力市场和课程分配等场景。

链接: https://arxiv.org/abs/2605.22865
作者: Irene Aldridge
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Econometrics (econ.EM); Theoretical Economics (econ.TH)
备注: 27 pages

点击查看摘要

Abstract:This paper proposes a computationally efficient mechanism for multi-dimensional matching markets where agents report preferences over object features rather than complete utility assessments. We use Singular Value Decomposition (SVD) to identify the principal direction of variation in feature space and match agents to objects along this dimension, reducing a complex multi-dimensional problem to an effectively one-dimensional problem solvable in O(N \log N) time. We show that when data exhibit low effective dimensionality, our mechanism approximately maximizes Nash Social Welfare, satisfies distributional truthfulness, and achieves symmetry. We establish a novel connection between Nash Social Welfare and Geometric Distributionally Robust Optimization, providing robustness guaranties. Numerical experiments demonstrate that our approach achieves 99% optimal welfare while running three orders of magnitude faster than direct optimization. The framework applies naturally to school choice, labor markets, and course allocation, where feature-based elicitation reduces the cognitive burden on agents. Comments: 27 pages Subjects: Computer Science and Game Theory (cs.GT); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Econometrics (econ.EM); Theoretical Economics (econ.TH) ACMclasses: J.4 Cite as: arXiv:2605.22865 [cs.GT] (or arXiv:2605.22865v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2605.22865 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-13] Evaluating Large Language Models in a Complex Hidden Role Game

【速读】：该论文旨在解决如何量化大型语言模型（Large Language Models, LLMs）在非受控环境中潜在的欺骗能力这一关键AI安全问题。其解决方案的核心在于构建一个开源框架，并引入三项新颖指标：角色识别准确率（Role Identification Accuracy）、欺骗维持率（Deception Retention Rate）和游戏状态影响率（Game State Impact Rate），以在社交推理游戏《秘密希特勒》（Secret Hitler）中系统评估LLMs的推理、说服与欺骗能力。研究发现，尽管当前模型具备较强的对话能力，但在复杂多轮策略操控上表现不足，尤其是法西斯角色模型不仅胜率下降最多达23.2%，且无法持续维持欺骗行为，导致游戏时长显著缩短（约减少40%）。此外，链式思维提示（Chain-of-Thought prompting）和内部记忆机制未能提升性能，表明现有架构尚未有效支持深层次战略欺骗，凸显了未来对模型欺骗行为检测与对齐研究的重要性。

链接: https://arxiv.org/abs/2605.22826
作者: Niklas Bauer
机构: University of Göttingen (哥廷根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: Master’s thesis, University of Göttingen

点击查看摘要

Abstract:Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled environments. This work investigates the reasoning, persuasion, and deceptive capabilities of LLMs within the social deduction game Secret Hitler. I introduce an open-source framework and novel metrics to measure performance: Role Identification Accuracy, Deception Retention Rate, and Game State Impact Rate. By benchmarking models against rule-based algorithms and human games, I identify a gap between conversational ability and strategic depth. The study also analyzes the impact of reasoning-enhancement techniques on win rates and strategic reasoning. Neither Chain-of-Thought prompting nor internal memory bring improvements in performance, with up to 23.2% worse win rates for fascist roles. While rule-based agents align with expert human voting decisions 86.7% of the time, models like Llama 3.1 70B achieve only a 59.7% accuracy. Models playing as Fascists consistently yield negative impact scores and fail to sustain deception, resulting in roughly 40% shorter games compared to humans. These findings suggest that current architectures remain ineffective at complex, multi-turn manipulation. As capabilities advance, detecting when models begin to master these deceptive behaviors is crucial. The developed framework serves as a reproducible testbed for future alignment research.

[MA-14] Optimal Design Framework for Distributed Array Using Magnetically-Actuated Satellite Swarm

【速读】：该论文旨在解决分布式空间天线（Distributed Space Antennas）在电磁编队飞行（Electromagnetic Formation Flight, EMFF）架构下，由于天线性能、卫星质量、功率生成、线圈几何形状及编队保持功率等耦合约束导致的可行孔径受限问题。其解决方案的关键在于提出了一种系统级设计框架，将相控阵需求与卫星级尺寸约束相连接，并通过基于分布式控制仿真得出的控制指标来量化编队维持要求，进而构建一个包含尺寸、功率、线圈和旁瓣包络约束的天线孔径最大化优化问题。该框架能够评估固定系统质量下可行的静态网格型EMFF天线配置，揭示不同参数（如磁矩裕度、发射功率和卫星间距）对设计可行性的影响，从而指导工程实践中多约束条件下的最优设计决策。

链接: https://arxiv.org/abs/2605.23481
作者: Seang Shim,Yuta Takahashi,Naoto Usami,Shin-ichiro Sakai
机构: The Graduate University for Advanced Studies (日本高级研究所大学); Institute of Science Tokyo (东京科学研究所); Interstellar Technologies Inc. (星际科技公司); Japan Aerospace Exploration Agency (日本宇宙航空研究开发机构)
类目: ignal Processing (eess.SP); Multiagent Systems (cs.MA)
备注: Submitted to IEEE Access and currently under review

点击查看摘要

Abstract:Distributed space antennas using electromagnetic formation flight (EMFF) are a promising architecture for large-aperture, long-life space communication systems. Their feasible aperture, however, is governed by coupled constraints on antenna performance, satellite mass, power generation, coil geometry, and formation-keeping power. This paper proposes a system-level design framework for EMFF-based distributed space antennas. It links phased-array requirements with satellite-level sizing constraints and provides a static grid-based reference for designing feasible apertures under a fixed system mass. Unlike our previous bucket-brigade disturbance-compensation model, the formation-maintenance requirement is incorporated through a control index derived from distributed-control simulations. This index is integrated into an antenna-aperture maximization problem with sizing, power, coil, and sidelobe-envelope constraints. Parametric case studies examine margin magnetic moment, prescribed transmit power, and large inter-satellite spacing. Results show that increasing system mass improves footprint reduction or effective isotropic radiated power only while satellite-level design headroom remains. In direct-to-device cases with 0.15 m spacing, generated-power and coil-geometry constraints dominate the feasible aperture. In the 0.60 m large-spacing case, the required coil burden can exceed satellite-level mass, size, and power capacities, making the design infeasible despite favorable communication performance. The proposed framework enables the design and evaluation of feasible static grid-based EMFF distributed antennas under coupled antenna, satellite, and control constraints.

[MA-15] Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test ATC DATE

【速读】：该论文试图解决的问题是：当联盟中最强成员国对较弱成员国施加领土与战略控制压力时，联盟内部的集体行动困境如何演化，以及主导国的行为是否会导致联盟规范失效。其解决方案的关键在于构建三类博弈模型（不对称胁迫、含临界阈值的北约保证博弈、含社会偏好的三方扩展式博弈），并通过多智能体仿真让8个前沿大语言模型（LLM）扮演6个地缘政治角色，在3,604场完整游戏中模拟行为，并利用逆博弈论方法估计每个模型的结构效用参数（α, β, γ, δ, η），从而量化材料自利、互惠、不平等厌恶、规范尊重和承诺一致性等心理机制的影响。研究发现，胁迫框架显著提升升级行为（四步升级从10.7%升至28.6%），中国起源模型在扮演美国角色时表现出不同于西方模型的权力权重特征，且和平获取格陵兰的路径极为罕见（仅1.9%的干净游戏中实现），其中DeepSeek V3.2表现最优，通过五轮稳定策略达成目标；同时，强调“国际强行法”（jus cogens）与自决权的提示可有效抑制升级，验证了规范性话语的作用。该研究为LLM地缘政治行为提供了结构性基准，超越传统动作频率指标。

链接: https://arxiv.org/abs/2605.22841
作者: Rommin Adl,Peyton Williams
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); General Economics (econ.GN)
备注: 78 pages, 17 figures, 18 tables. Multi-agent LLM simulation recovering structural utility parameters across 8 frontier models in the Greenland sovereignty crisis. v3: typo pass, fixes phantom action names (REQUEST_MULTILATERAL, INDEPENDENT) and a Blunden date mismatch. v2 added Section V safety findings (legitimacy-laundered escalation, signal decoupling) and Appendix H

点击查看摘要

Abstract:What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019-2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective-action problems: Arctic strategic control and whether NATO can enforce alliance norms against the dominant member. We develop three games (asymmetric coercion; a NATO assurance game with a critical-mass tipping point; a triadic extensive-form game with social preferences) and test them with a multi-agent simulation in which eight frontier LLMs play six geopolitical roles (United States, Denmark, Greenland, NATO, Russia, Canada) across 3,604 completed games and 108,120 action observations. Using inverse game theory, we recover each model’s structural utility parameters (alpha, beta, gamma, delta, eta) for material self-interest, reciprocity, inequality aversion, norm respect, and commitment consistency. Three findings stand out. First, all eight models become more escalatory under coercion framing (four-action escalation rises from 10.7% to 28.6%). Second, Chinese-origin models show systematically different power-weight profiles from Western-origin models when playing the U.S. role. Third, peaceful US acquisition emerges in only 1.9% of clean games and only 3 of 8 frontier models ever achieve it, most prominently DeepSeek V3.2, which executes a stable five-round playbook through the metropole. Prompts emphasizing jus cogens and self-determination reduce escalation back near baseline in the English-only confirmatory sample; multilingual contrasts are reported as exploratory sensitivity checks. We position this as a structural benchmark for LLM geopolitical behavior, complementing action-frequency benchmarks.

[MA-16] Computable Fairness: Boltzmann-Softmax Control for AI Resource Allocation

【速读】：该论文试图解决大规模人工智能系统中资源（如GPU计算时间和带宽）分配问题，尤其是传统基于效率指标的策略可能导致资源分配不均，进而引发主导集中现象，损害系统的多样性与稳定性。其解决方案的关键在于提出一种名为“可计算公平分配”（Computable Fair Division, CFD）的框架，将Boltzmann-Softmax函数重新解释为一种概率性资源分配机制，并将逆温度参数 $\beta$ 作为可计算的控制变量，用于调节效率与公平之间的平衡。进一步地，论文设计了AHC++（自适应硬上限控制器++），通过实时反馈观测到的主导程度与预设目标之间的误差动态调整 $\beta$ ，从而在面对外部冲击时抑制极端主导集中现象，同时保持公平性目标且不会显著降低吞吐量。

链接: https://arxiv.org/abs/2605.22827
作者: Ji-Won Park,Chae Un Kim
机构: 未知
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Performance (cs.PF)
备注: 40 pages, 12 figures, 5 tables. Code: this https URL

点击查看摘要

Abstract:In large-scale AI systems, allocating scarce resources such as GPU compute time and bandwidth among multiple agents is a critical challenge. Conventional policies focus on efficiency metrics, potentially leading to dominance concentration that undermines system diversity and stability. We propose Computable Fair Division (CFD), a framework that reinterprets the Boltzmann-Softmax function not as a selection tool but as a probabilistic resource allocation mechanism, redefining the inverse temperature parameter \beta as a computable control variable governing the efficiency-fairness balance. Static analysis reveals a Pareto frontier with a near-optimal Stability Corridor where total loss remains approximately constant across policy weights. In the dynamic setting, AHC++ (Adaptive Hard-Cap Controller++) updates \beta in real time using the error between observed dominance and a policy-specified target as feedback. Simulations show that AHC++ suppresses extreme dominance concentration under exogenous shocks while tracking fairness targets without substantial throughput degradation. Scalability analysis confirms that a 100x increase in agents yields only approximately 5.5x increase in execution time. Code: this https URL

自然语言处理

[NLP-0] SkillOpt: Executive Strategy for Self-Evolving Agent Skills

【速读】：该论文试图解决当前代理技能（agent skills）训练缺乏系统性和可控性的问题，即现有方法如手工设计、一次性生成或松散控制的自修订机制无法像深度学习优化器一样有效提升技能性能，且难以在反馈下稳定改进。解决方案的关键在于提出SkillOpt——首个面向代理技能的系统化文本空间优化器：它将技能视为冻结代理的外部状态，通过一个独立的优化模型，基于评分轨迹对单一技能文档执行受限的增删改操作（add/delete/replace），仅当编辑能严格提升保留验证集得分时才被接受；同时引入文本学习率预算、拒绝编辑缓冲区和分轮次缓慢更新策略，在不增加推理时模型调用的前提下确保训练稳定性。实验表明，SkillOpt在六个基准测试、七个目标模型和三种执行环境（直接对话、Codex、Claude Code）中均表现最优或并列第一，并显著提升GPT-5.5的准确率（最高+24.8分），且优化后的技能具有良好的跨模型规模、执行环境和任务迁移能力。

链接: https://arxiv.org/abs/2605.23904
作者: Yifan Yang,Ziyang Gong,Weiquan Huang,Qihao Yang,Ziwei Zhou,Zisu Huang,Yan Li,Xuemei Gao,Qi Dai,Bei Liu,Kai Qiu,Yuqing Yang,Dongdong Chen,Xue Yang,Chong Luo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.

[NLP-1] ETCHR: Editing To Clarify and Harness Reasoning

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉推理中因纯文本思维链（chain of thought）导致的局限性问题，特别是针对需要细粒度关注或视角变换的任务。现有“用图像思考”（think with images）方法要么受限于固定预定义工具集，要么由统一的多模态方法生成噪声中间图像。解决方案的关键在于提出ETCHR（Editing To Clarify and Harness Reasoning），一个解耦的、以问题为条件且具备推理感知能力的图像编辑器：通过两阶段训练策略——第一阶段使用监督微调（supervised fine-tuning）模仿编辑轨迹以弥补语言侧差距（即无法将抽象问题映射到适当视觉变换），第二阶段利用视觉语言模型（VLM）衍生奖励进行推理增强（reasoning enhancement），从而提升编辑正确性和下游推理准确性。由于编辑器与理解模型完全解耦，ETCHR可无需训练直接适配多种开源和闭源MLLM，在五个任务类别（细粒度感知、图表理解、逻辑推理、拼图复原和3D理解）中显著提升平均Pass@1指标，最高达+5.47。

链接: https://arxiv.org/abs/2605.23897
作者: Beichen Zhang,Yuhong Liu,Jinsong Li,Yuhang Zang,Jiaqi Wang,Dahua Lin
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); CPII under InnoHK (InnoHK计划下的CPII)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code, model and data are open-sourced at this https URL

点击查看摘要

Abstract:Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ‘‘think with images’’ paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.

[NLP-2] Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

【速读】：该论文试图解决多语言语言模型在低资源语言上性能不足的问题，尤其是在缺乏目标语言训练数据时，如何有效从高资源语言（如英语）迁移科学推理、常识推理和世界知识等关键能力。现有方法通常依赖大量平行语料、翻译系统或额外训练阶段，这些在多数低资源语言中难以获得。解决方案的关键在于提出一种名为LINK的数据层面干预方法：在预训练阶段，通过双语词汇表对高资源语言（英语）训练数据中随机选择的词语进行词级替换，从而增强跨语言知识迁移效果。该方法无需额外模型训练，仅需一个低成本获取的双语词汇表，即可显著提升目标语言下游任务性能，实验表明其在8种语言、5种模型规模下均取得明显改进，且达到同等性能所需训练时间最多缩短至原来的一半。

链接: https://arxiv.org/abs/2605.23885
作者: Anastasiia Sedova,Natalie Schluter,Skyler Seto,Maartje ter Hoeve
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involving scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK - a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vocabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabulary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance.

[NLP-3] Strong Teacher Not Needed? On Distillation in LLM Pretraining

【速读】：该论文试图解决的问题是：在大语言模型（Large Language Model, LLM）预训练中，知识蒸馏（Knowledge Distillation）是否始终依赖于“强教师-弱学生”的假设，即是否只有更强的教师才能有效提升学生的性能。传统观点认为教师模型越强大（如参数更多或训练更充分），蒸馏效果越好，但本文通过系统性实验验证了这一假设的局限性。解决方案的关键在于：引入语言建模损失与知识蒸馏损失的合理混合策略，使得即使使用小规模且训练不足的教师模型，也能显著提升大型学生模型的性能；同时发现，过度增强教师模型（如增加参数量或训练数据量）可能导致蒸馏收益饱和甚至下降。此外，研究还表明知识蒸馏更有利于提升模型的泛化能力（如分布外性能和下游任务表现），而非单纯提升域内拟合效果。这些发现挑战了“教师必须强大”的普遍认知，为知识蒸馏在LLM预训练中的实践提供了新的理论依据和优化方向。

链接: https://arxiv.org/abs/2605.23857
作者: Taiming Lu,Zhuang Liu
机构: Princeton University (普林斯顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, same-level, and weak-to-strong teacher-student relationships, and study distillation’s effectiveness under each. We find that the teacher need not be strong: with proper mixing of the language modeling and knowledge distillation losses, even small and undertrained teachers improve larger students. At the same time, a stronger teacher is not always better: pushing the teacher further, through more parameters or more training tokens, can saturate or even reverse the distillation gains. We further observe that distillation improves generalization (out-of-distribution and downstream performance) more readily than in-domain fitting. Together, these results challenge the common belief that distillation pretraining always requires a strong teacher.

[NLP-4] Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

【速读】：该论文试图解决长视频问答（long-video question answering, QA）中如何有效选择关键帧以提供可验证视觉证据的问题。现有方法要么对每个查询统一评分所有帧，要么将查询分解为固定模式并由单一视觉工具评估，难以适应多样化的查询需求。解决方案的关键在于提出ToolMerge方法：利用大语言模型（Large Language Model, LLM）作为规划器，将查询分解为多个工具调用（tool calls），并指定如何使用布尔运算符合并各工具的排序结果，从而实现灵活、语义感知的关键帧检索。实验表明，在自建的Molmo-2 Moments（M2M）基准上，ToolMerge在多项任务中表现优异，尤其在字幕检索任务中比其他方法提升5%。

链接: https://arxiv.org/abs/2605.23826
作者: Michal Shlapentokh-Rothman,Prachi Garg,Yu-Xiong Wang,Derek Hoiem
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at this https URL .

[NLP-5] Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

【速读】：该论文试图解决的问题是：语言模型中语义层次结构（如上下位关系，即“is-a”关系）如何以几何方式编码在词向量表示中。解决方案的关键在于提出了一种分布理论，通过分析词共现统计与WordNet超类图之间的关系，揭示了词向量嵌入的Gram矩阵谱结构如何自然地组织成从粗到细的分层几何形态。作者证明，在共现核满足弱正性和衰减条件的前提下，主特征向量首先分离出广泛的分类分支，随后逐步细化子分支，形成与WordNet树结构一致的层级分裂几何。这一发现不仅在word2vec嵌入中得到验证，也在Gemma 2B模型的解嵌入中表现出强泛化能力，表明层次性语义结构可能源于词对统计的谱特性，而非特定的层级功能机制。

链接: https://arxiv.org/abs/2605.23821
作者: Andres Nava,Matthieu Wyart
机构: Johns Hopkins University (约翰霍普金斯大学); EPFL (洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 34 pages, 12 figures, including appendices

点击查看摘要

Abstract:We propose a distributional theory of how hypernymy – the ``is-a’’ relation between general and specific concepts – is encoded geometrically in language representations. Starting from the empirically verified assumption that words closer on the WordNet hypernym graph co-occur more often, we characterize theoretically the spectrum of the resulting embedding Gram matrix of word2vec embeddings. Under mild positivity and decay conditions on the co-occurrence kernel, we prove that the leading eigenvectors first separate broad taxonomic branches and then progressively finer sub-branches, producing a \emphhierarchical splitting geometry with a coarse-to-fine spectral organization that mirrors the tree. We confirm these predictions in word2vec embeddings across many sampled WordNet subtrees, and show that the same signature extends strikingly well to Gemma 2B unembeddings. Our results indicate that hierarchical concept geometry in LLMs need not reflect a hierarchy-specific functional mechanism, but emerges from the spectral structure of pairwise word statistics.

[NLP-6] Is a Document Educational or Just Wikipedia-Style? – Pitfalls of Classifier-Based Quality Filtering ACL2026

【速读】：该论文试图解决的问题是：基于分类器的质量过滤（Classifier-based Quality Filtering）方法在构建预训练语料库时可能存在的脆弱性，即当前依赖单一模型替代或补充传统启发式规则的方案是否真正可靠。解决方案的关键在于揭示了一个简单但显著的漏洞——通过一种类似维基百科的重格式化操作（Wikipedia-style reformatting），可以显著改变模型对文本质量的评估结果，导致原本应被过滤掉的低质量内容绕过阈值进入训练语料库；具体而言，研究发现FineWeb-Edu CQF模型会对约7%的文档反转其过滤决策，从而暴露了此类过滤机制在面对结构化文本变换时的不稳定性。

链接: https://arxiv.org/abs/2605.23721
作者: Mateusz Klimaszewski,Piotr Andruszkiewicz
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model’s quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.

[NLP-7] NLG Evaluation: Past Present Future

【速读】：该论文试图解决的问题是自然语言生成（Natural Language Generation, NLG）评估方法的演变及其未来发展方向。随着NLG从早期与语言学紧密关联转向与机器学习深度融合，传统的定性评估逐渐被更系统、可量化的实验评估所取代，尤其是近年来出现的“大语言模型作为裁判”（LLM-as-Judge）等新方法。论文指出，未来NLG评估的关键在于强化对生成内容的影响（impact）、定性质量以及安全性等方面的考量，以适应大规模用户日常使用NLG技术所带来的新挑战和需求。

链接: https://arxiv.org/abs/2605.23715
作者: Ehud Reiter
机构: University of Aberdeen (阿伯丁大学)
类目: Computation and Language (cs.CL)
备注: Will appear in Proceeedings of RetroEval 2026

点击查看摘要

Abstract:Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.

[NLP-8] A graph-based analysis of semantic types and coercion in contextualized word embeddings

【速读】：该论文试图解决词汇与其语境之间语义类型不匹配（semantic type mismatch）所引发的强制转换（coercion）现象的建模问题。其解决方案的关键在于提出一种基于图结构的方法，利用BERT和词义增强嵌入（sense-enhanced embeddings）构建词语邻域图，并引入两个指标——邻居类型概率（Neighbor Type Probability, NTP）和邻居类型熵（Neighbor Type Entropy, NTE），用于量化分析词嵌入邻域中的语义类型分布特征。实验表明，使用词义增强嵌入构建的图能更有效地反映语义类型信息，且NTP与NTE可有效区分语义类型匹配与不匹配（包括强制转换）的句子。

链接: https://arxiv.org/abs/2605.23710
作者: Long Chen,Deniz Ekin Yavas
机构: Heinrich Heine University Düsseldorf (海因里希·海涅大学杜塞尔多夫)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic type mismatch between a noun and its context is central to coercion phenomena. This paper introduces a graph-based method to examine how lexical and contextual type information is reflected in word embeddings. We select nouns from ten semantic types, annotate corpus instances for type matching (matching vs. coercion vs. other mismatch vs. unrestricted), and construct graphs using BERT and sense-enhanced embeddings. Two metrics – Neighbor Type Probability (NTP) and Neighbor Type Entropy (NTE) – are proposed to analyze neighborhood type distributions. Results show that graphs constructed with sense-enhanced embeddings reflect semantic type information better, and matching and mismatch sentences can be distinguished through the proposed metrics.

[NLP-9] Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks ICML2026

【速读】：该论文试图解决的问题是：当前弱标签基准测试（weak-label benchmarks）中存在“捷径学习”（shortcut learning）现象，即模型可能仅依赖元数据（metadata）而非真实证据进行预测，从而导致评估结果不可靠。现有方法仅通过元数据统计（如MPDS）来检测这种捷径行为，但无法识别模型是否真正对输入证据敏感。解决方案的关键在于引入一个协议级测试（protocol-level test），结合两个核心指标：一是元数据优先主导得分（Metadata Prior Dominance Score, MPDS），用于衡量模型是否过度依赖元数据；二是证据干预敏感度（ΔEvi），通过跨项目随机打乱证据身份来测量模型输出对证据内容的敏感性。实验表明，仅依赖MPDS会遗漏关键问题（如合成HotpotQA案例中MPDS为0.643而ΔEvi为0），而加入读者强度校准（reader-strength calibration）可进一步揭示模型行为的可靠性（如SNLI中的校准反转、FEVER作为强证据敏感正控）。因此，论文提出基准审计应同时报告元数据筛选、证据干预和读者校准三类信息，以全面评估模型的真实性与鲁棒性。

链接: https://arxiv.org/abs/2605.23701
作者: Kan Shao
机构: Jinglue Technology Development (Nanjing) Co., Ltd.
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure, 1 table. Accepted at ICML 2026 Workshop on Hypothesis Testing

点击查看摘要

Abstract:We study a protocol-level test for weak-label benchmarks: whether benchmark outputs change when the provided evidence is intervened on. Metadata-only shortcut checks answer a different question, namely whether outputs are predictable from metadata priors. We therefore combine a metadata statistic, the Metadata Prior Dominance Score (MPDS), with an evidence-intervention statistic, \DeltaEvi, measuring sensitivity to evidence identity under cross-item shuffling. Synthetic HotpotQA gives a constructed counterexample to metadata-only screening: MPDS is only moderate (0.643), yet \DeltaEvi is zero. Stronger-reader reruns show why calibration belongs in the test procedure: SNLI shows a calibration reversal, reconstructed HotpotQA occupies a question-dominant warning region, and FEVER is a strongly evidence-sensitive positive control across four transformers. The practical lesson is simple: benchmark audits should report metadata-only screening, evidence intervention, and reader-strength calibration together.

[NLP-10] ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

【速读】：该论文试图解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在图表描述生成任务中存在的两个核心问题：一是现有数据集由简单、同质化的图表与浅层的事实列举式描述组成，难以真实反映复杂图表的描述需求；二是现有评估指标无法全面衡量描述质量的多维特性。解决方案的关键在于构建一个名为ChartFI-Bench的高质量基准测试集，包含896对视觉复杂的图表与语义丰富的描述，并提出四个对齐的评估维度——忠实性（Faithfulness）、覆盖度（Coverage）、信息量（Informativeness）和敏锐度（Acuity），分别对应高质图表描述的四个关键维度：事实准确性、显著特征强调、领域知识引导以及图表与文本的互补性。该框架不仅提升了评测的系统性和科学性，还揭示了主流MLLMs在实际应用中的普遍不足。

链接: https://arxiv.org/abs/2605.23694
作者: Fen Wang,Zekai Shao,Qiman Kang,Chunran Hu,Zhixuan Zhang,Lexu Xie,Chao Liu,Siming Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics – Faithfulness, Coverage, Informativeness, and Acuity – to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

[NLP-11] OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）对话系统中缺乏主动性的问题，即当前系统仅能被动响应用户输入，无法预测用户的下一步查询。为实现主动交互，研究聚焦于“下一查询预测”（next-query prediction）任务，其核心挑战在于如何在保持预测准确性的同时避免因使用完整对话历史而导致的线性增长的token消耗。解决方案的关键在于提出OnePred框架，该框架通过维护一个递归更新的记忆模块作为跨轮次上下文，从而将每轮的计算成本与对话长度解耦；同时采用两阶段强化学习训练机制，先学习预测目标，再学习压缩策略，使记忆模块演化为面向预测的意图链（intent chain）。实验表明，OnePred相比全历史输入可将每轮token消耗降低最多22倍，且在各类基准上均优于现有方法，尤其在长对话中优势更显著。

链接: https://arxiv.org/abs/2605.23668
作者: Jiangwang Chen,Bowen Zhang,Zixin Song,Jiazheng Kang,Xiao Yang,Da Zhu,Guanjun Jiang
机构: Tsinghua University (清华大学); Qwen Applications Business Group of Alibaba (阿里巴巴通义应用业务组)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although large language model (LLM) conversational systems process millions of multi-turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next-query prediction, which anticipates the user’s subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency–quality trade-off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross-turn context. Our key insight is that accurate prediction does not require re-reading raw history; it suffices to track the user’s evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross-turn context, bounding the per-turn cost independently of conversation length. We train the model via a two-stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction-oriented intent chain. To establish a rigorous testbed, we introduce NQP-Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per-turn token consumption by up to 22 \times compared to full-history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at this https URL.

[NLP-12] OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

【速读】：该论文试图解决的问题是：随着开源技能（skills）生态系统的快速扩展，当前尚不明确不同大语言模型（LLMs）和智能体框架（agent frameworks）如何与技能交互、如何评估技能质量，以及用户在实际成本-性能权衡下应如何选择技能。解决方案的关键在于提出一个名为 \textscOpenSkillEval 的自动化评估框架，该框架通过从五个下游应用类别（演示文稿生成、前端网页设计、海报生成、数据可视化和报告生成）中动态构建真实世界任务实例，替代静态基准测试，并系统性地收集和组织社区贡献的技能，在统一任务设置下进行受控比较。实验基于600多个动态生成的任务实例和30个开源技能对主流模型和框架进行了评估，揭示了技能可用性并不等同于有效使用，技能增强效果强烈依赖于底层模型和代理框架，且许多流行技能并未持续优于无技能的基础代理，从而强调了任务驱动的动态评估的重要性，并为技能的设计、选择和部署提供了实践指导。

链接: https://arxiv.org/abs/2605.23657
作者: Jiahao Ying,Boxian Ai,Wei Tang,Siyuan Liu,Yixin Cao
机构: Singapore Management University; Institute of Trustworthy Embodied AI, Fudan University; Joy Future Academy, JD
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textscOpenSkillEval, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textscOpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: this https URL.

[NLP-13] How Human-Like Are Large Language Models ? A Register-Aware Linguistic Evaluation Framework

【速读】：该论文试图解决的问题是：当前大型语言模型（LLM）研究多关注事实正确性和任务性能，但对生成文本在语言层面上的人类相似性（human-likeness）缺乏系统评估。从语料库语言学角度看，语言产出具有高度情境依赖性，不同交际场景下语言特征的频率和共现模式存在差异，若生成文本偏离这些模式，即使内容正确也可能影响人类读者的接受度。解决方案的关键在于提出一种情境感知的评估框架，通过双样本问题比较特定语域（register）下人类参考语料库与LLM生成语料库在67个Biber提出的词汇语法特征上的分布差异，并使用最大均值差异（Maximum Mean Discrepancy, MMD）进行量化。实验表明，尽管所有测试模型均偏离人类基准，但最接近人类语言的模型因语域而异，且不随模型规模单调变化。

链接: https://arxiv.org/abs/2605.23651
作者: Björn Nieth(1 and 4),Marianna Gracheva(2),Michaela Mahlberg(2 and 3),Bjoern Eskofier(1 and 3 and 5 and 6),Emmanuelle Salin(1) ((1) Department Artificial Intelligence in Biomedical Engineering (AIBE) FAU Erlangen-Nürnberg Germany, (2) Department of Digital Humanities and Social Studies (DHSS) FAU Erlangen-Nürnberg Germany, (3) University of Birmingham United Kingdom, (4) Chair of AI-supported Therapy Decisions LMU München Munich Germany, (5) Munich Center for Machine Learning (MCML) Munich Germany, (6) Institute of AI for Health Helmholtz Zentrum München Neuherberg Germany)
机构: FAU Erlangen-Nürnberg (弗莱堡大学); University of Birmingham (伯明翰大学); LMU München (慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Helmholtz Zentrum München (赫尔姆霍兹慕尼黑研究中心)
类目: Computation and Language (cs.CL)
备注: 8.5 pages (main) + 31 pages appendix, 29 figures, 10 tables. Code and data: this https URL

点击查看摘要

Abstract:While factual correctness and task-performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human-like generated texts are on a linguistic level has been underexplored. From a corpus-linguistic perspective, language production is inherently context-dependent, with distinct communicative contexts giving rise to differences in frequencies and co-occurrence patterns of linguistic features. A text failing to adhere to these patterns can be content-wise correct, but still be unfavorable to human readers. In this work, we propose a context-aware evaluation framework in which human-likeness is assessed using a two-sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding LLM-generated corpus. We implement this framework using the Maximum Mean Discrepancy (MMD) and the 67 lexico-grammatical features introduced by Biber, which are commonly applied in corpus linguistics. In our experiments, we compare seven instruction-tuned, open-source models across five English-language datasets spanning distinct registers against a human baseline. While across all tested setups, LLMs deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size.

[NLP-14] Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

【速读】：该论文旨在解决大规模语言模型在跨语言检索任务中性能与效率之间的权衡问题，特别是针对生成式 AI (Generative AI) 应用场景下嵌入模型（Embedding Model）的性能评估与部署优化。其解决方案的关键在于系统性地对比谷歌托管的双编码器模型 Google Embeddings 2（GE2）与其五种开源替代方案（包括 BGE-M3、E5-large、mE5-L、LaBSE 和 mMPNet），通过多维度基准测试（涵盖 BEIR 数据集、合成意大利语 RAG 语料库、不同分块策略及查询延迟）揭示各模型在精度与推理速度上的差异。研究发现 GE2 在所有任务上均排名第一，但延迟显著较高（231.6ms），而 mE5-L 在保持接近 GE2 性能的同时仅需 31ms，成为对低延迟敏感场景的优选；同时，LaBSE 虽广泛用于多语言任务，但在 BEIR 上表现低于多数专用检索模型，凸显了任务适配性的重要性。此外，分块实验表明，所有模型在 32-token 分块时达到性能饱和，且语义分块仅在 16-token 粒度下带来可测量提升，为实际应用中的文本预处理提供了量化依据。

链接: https://arxiv.org/abs/2605.23618
作者: Stefano Cirillo,Domenico Desiato,Giuseppe Polese,Giandomenico Solimando
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 5 tables. Text and evaluation code available at this https URL

点击查看摘要

Abstract:We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR this http URL@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.

[NLP-15] DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

【速读】：该论文试图解决扩散语言模型（Diffusion Language Models）在生成过程中无法有效捕捉解码 token 之间相关性的问题，这一缺陷导致采样质量与吞吐量之间存在严重权衡。解决方案的关键在于提出 DiLaDiff，其核心创新包括：(1) 通过微调现有掩码扩散语言模型获得具备语义能力的连续潜在空间（continuous latent space）；(2) 构建一个学习编码器分布先验的潜在扩散模型；(3) 利用一致性模型（consistency model）将学习到的先验蒸馏为少步数的潜在生成模型。实验表明，即使不进行蒸馏，该潜在引导的扩散模型也优于原始掩码扩散基线且显著加速推理；进一步通过一致性蒸馏可大幅降低连续扩散的计算开销，使得潜在变量生成时间可忽略不计。

链接: https://arxiv.org/abs/2605.23605
作者: Jean-Marie Lemercier,Tomas Geffner,Karsten Kreis,Morteza Mardani,Arash Vahdat,Ante Jukić
机构: NVIDIA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked diffusion language models with three components: (1) a continuous latent space with semantic capabilities, learned by an auto-encoder fine-tuned from an existing masked diffusion language model; (2) a latent diffusion model learning the prior over the encoder distribution; (3) a consistency model distilling the learned prior into a few-step latent generative model. We show that, even without distillation, our latent-guided diffusion model outperforms the masked diffusion baseline while significantly accelerating inference. Consistency distillation further lowers the computational overhead of continuous diffusion, such that the latent is generated in negligible time compared to discrete decoding.

[NLP-16] Structure-Guided Entity Resolution: Fine-Tuning LLM s for Robust Name Matching in Complex Linguistic Contexts ACL2026

【速读】：该论文旨在解决在语言和文化复杂环境中，跨异构记录匹配个人姓名的难题，这是实体消歧（Entity Resolution）的核心挑战之一。由于命名习惯差异、不同文字系统间的转写不一致以及频繁的数据录入错误，统一用户身份变得极为困难，而这对合规性要求严格的“了解你的客户”（Know Your Customer, KYC）流程至关重要。解决方案的关键在于提出一种结构引导的实体消歧框架（Structure-Guided Entity Resolution, SGER），通过两阶段课程学习（curriculum learning）对大型语言模型（LLM）进行微调：第一阶段训练模型解析姓名的语法与语义结构，第二阶段优化其用于二元实体匹配的下游任务。实验表明，在全球语言多样性最高且噪声最严重的印度身份数据集上，SGER在5万条真实配对样本中达到99.02%准确率和0.994 F1分数，显著优于GPT-4o零样本提示和单阶段微调基线，并已成功部署于全球最大的幻想体育平台Dream11（服务超2.5亿用户），验证了该方法在多语言生产环境中的高精度与可扩展性。

链接: https://arxiv.org/abs/2605.23597
作者: Shivam Chourasia,Hitesh Kapoor,Nilesh Patil
机构: Dream Sports
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2026. 8 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Matching person names across heterogeneous records is a core challenge in entity resolution, especially within linguistically and culturally complex environments. Variations in naming conventions, inconsistent transliteration across scripts, and frequent data entry errors make it difficult to unify user identities, an essential requirement for Know Your Customer (KYC) compliance. While Large Language Models have shown promise in understanding natural language, they often struggle with the structured ambiguity present in such domain-specific settings. This paper introduces Structure-Guided Entity Resolution (SGER), a novel framework that fine-tunes an LLM through a two-phase curriculum. The model is first trained to parse the grammatical and semantic structure of personal names, then optimized for the downstream task of binary entity matching. We evaluate SGER in the challenging context of Indian identity data, one of the most linguistically diverse and noisy environments globally. SGER achieves 99.02% accuracy and an F1 of 0.994 on a held-out set of 50,000 real-world pairs, outperforming GPT-4o few-shot prompting and single-stage fine-tuning baselines. The system is fully deployed in production at Dream11, the world’s largest fantasy sports platform, serving 250M+ users. Our results demonstrate that curriculum-guided training enables robust, high-precision entity resolution in real-world multilingual systems at scale.

[NLP-17] Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM -based Statutory Question Answering

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在法律研究中因训练数据时间固定而导致的时效性问题，具体包括两类时间失效模式：一是“截止后僵化”（post-cutoff staleness），即模型在立法修订后仍应用已被废止的规则；二是“近期偏差”（recency bias），即模型偏好较新的法律条文，即使历史版本才适用于当前案情。解决方案的关键在于将时间有效性作为硬约束，并通过两种检索增强生成（Retrieval-Augmented Generation, RAG）方法实现：一是基于事实日期提取和版本过滤来确保检索到的法律文本在时间上有效。实验结果表明，仅使用原始LLM（Vanilla）在截止后设置下性能严重下降，而两种RAG方案显著提升各类问题的准确率，相比之下网络搜索虽带来不稳定收益但加剧了历史锚定任务中的近期偏差。因此，论文强调可靠法律问答必须严格处理时间有效性约束。

链接: https://arxiv.org/abs/2605.23497
作者: Max Prior,Andreas Schultz,Matthias Grabmair
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.

[NLP-18] CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

【速读】：该论文试图解决的问题是：当前基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）和测试时扩展（Test-Time Scaling, TTS）方法在大语言模型（LLM）代码生成中高度依赖真实单元测试（Ground-Truth Unit Tests, GT UTs），而GT UTs的获取成本高昂，且现有无GT UT的TTS方法因自动生成的单元测试质量低、存在噪声或与错误代码伪相关，导致性能受限。解决方案的关键在于提出一种无需GT UT、无需训练的协同自我博弈框架CoSPlay，通过双向通过计数信号（bidirectional pass-count signals）从代码-单元测试执行矩阵中迭代地修剪弱代码、刷新不可靠UT，并利用输出一致性聚类选择最优代码，从而实现代码与单元测试的联合进化与质量提升。实验表明，CoSPlay在多个基准上显著优于现有无GT UT方法，且不依赖任何GT数据即可达到甚至超越RLVR模型的性能。

链接: https://arxiv.org/abs/2605.23491
作者: Zhangyi Hu,Chenhui Liu,Tian Huang,Jindong Li,Yang Yang,Jiemin Wu,Zining Zhong,Menglin Yang,Yutao Yue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at: this https URL | Data log is available at: this https URL

点击查看摘要

Abstract:Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.

[NLP-19] ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

【速读】：该论文试图解决的问题是：如何在大规模场景下有效扩展基于评分标准（rubric-based）的强化学习（RL），以支持那些无法通过自动验证答案的任务，尤其是针对大语言模型（LLM）的开放性生成任务。现有方法受限于专家编写评分标准和人工构建问题集，且固定的任务级评分标准难以适配具体问题的评价需求。解决方案的关键在于提出 ARES（Automated Rubric synthEsis for Scalable RL）框架，该框架能够从原始预训练文本中自动构建包含问题-答案对及其对应加权评分标准的高质量数据集，并通过条件生成（如领域标签和人物设定）与验证过滤机制（确保问题自洽性、答案忠实性和评分标准有效性）提升多样性与可靠性，从而实现实例级别的奖励监督，显著优于持续预训练、监督微调和二元奖励RL方法，在多维开放任务（如医疗和指令遵循）上表现最优。

链接: https://arxiv.org/abs/2605.23454
作者: Xiaoyuan Li,Keqin Bao,Moxin Li,Yubo Ma,Yichang Zhang,Wenjie Wang,Fuli Feng,Dayiheng Liu
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches often rely on expert-written rubrics and manually constructed question sets, while fixed task-level rubrics may fail to capture the evaluation requirements of individual questions. We propose ARES (Automated Rubric synthEsis for Scalable RL), a framework for automatically constructing rubric-based RL data at scale. Starting from raw pretraining documents, ARES converts source knowledge into self-contained question-answer pairs and co-generates question-specific weighted rubrics, enabling instance-level reward supervision for open-ended responses. To improve diversity and quality, ARES conditions generation on domain labels and persona information, and applies validation filters for question self-containment, answer faithfulness, and rubric validity. Using ARES, we construct 100K rubric-annotated instances across ten domains. Experiments on seven benchmarks show that rubric-based RL trained with ARES, outperforms continual pretraining, supervised fine-tuning, and binary-reward RL, with the largest gains on multi-dimensional open-ended tasks such as healthcare and instruction following.

[NLP-20] SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

【速读】：该论文旨在解决联合实体与关系抽取（Joint Entity and Relation Extraction, JERE）模型在低质量训练数据下泛化能力弱的问题。现有数据增强方法常忽视文本语义相关性，破坏句子结构和依赖关系，导致生成的数据无法有效提升模型鲁棒性。其解决方案的关键在于提出一种结构化语义数据增强方法（Structured Semantic Data Augmentation, SSDAU），通过基于实体标签的文本分段、上下文感知的编码器捕捉实体语义特征，并进行实体语义重构以生成高质量增强数据；同时融合上下文化嵌入与传统相似度评分以区分语义相近实体，并利用BERTTopic模型过滤无关主题，确保增强数据的主题一致性。实验表明，SSDAU在多个数据集和JERE模型上显著优于七种主流基线方法，在面对歧义时表现出更强的稳定性（F1下降仅8.26%，而基线平均下降31.91%）。

链接: https://arxiv.org/abs/2605.23440
作者: Jiawei He,Mengyu Shi,Chunrong Fang
机构: Nanjing University (南京大学); Amap (高德地图), Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figure

点击查看摘要

Abstract:Joint Entity and Relation Extraction (JERE) is highly susceptible to weak generalization due to low-quality training data. Data augmentation is a common strategy to enhance model generalization across different domains. However, existing data augmentation methods often overlook text relevance and may disrupt semantic structures and dependencies, making it difficult to generate effective augmented data for improving model generalization. In this paper, we propose Structured Semantic Data Augmentation (SSDAU), a novel method designed to preserve the semantic structure of text during augmentation. SSDAU segments text based on entity labels and employs an encoder to capture semantic features of entities through context awareness. It then performs entity semantic restructuring to generate augmented data. To distinguish semantically similar entities, SSDAU fuses contextualized embeddings with traditional similarity scores. To mitigate potential topic ambiguity and information loss, we apply the BERTTopic model to filter out irrelevant topics, ensuring topic consistency. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular data augmentation baselines. Experiments demonstrate that SSDAU generates semantically consistent data with superior robustness against ambiguity (8.26% F1 decrease vs.\ 31.91% for baselines), significantly outperforming all existing methods across all metrics. Comments: 12 pages, 3 figure Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.23440 [cs.CL] (or arXiv:2605.23440v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.23440 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiawei He [view email] [v1] Fri, 22 May 2026 09:52:43 UTC (195 KB) Full-text links: Access Paper: View a PDF of the paper titled SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction, by Jiawei He and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-21] Naturalistic measure of social norms alignment

【速读】：该论文试图解决的问题是如何在自然语境下的开放式对话中有效衡量社会规范一致性（social norms alignment），即评估不同个体或模型对社会困境解决方案的共识程度。传统方法依赖于人工设计的封闭式测评（如多选题问卷或预设陈述的一致性判断），难以捕捉真实互动中的复杂社会认知。其解决方案的关键在于提出了一种基于“解法匹配”（solution matching）的框架，通过构建参考解法库（由三位文化背景明确的评委提供）和定义两种新指标——“陈述一致准确率”（stated agreement accuracy）与“显式一致准确率”（explicit agreement accuracy），实现了对人类与大语言模型（LLM）之间社会规范对齐程度的量化测量。该框架支持任意两个回应之间的比较（如人-人、人-模型、模型-模型），并在包含3000个非平凡丹麦社会困境的语料库上验证了其有效性，揭示了不同议题类型下的一致性差异，为研究文化嵌入的社会推理提供了可扩展的评估工具。

链接: https://arxiv.org/abs/2605.23420
作者: Yevhen Kostiuk,Kenneth Enevoldsen,Peter Bjerregaard Vahlstrup,Márton Kardos,Kristoffer Nielbo
机构: Aarhus University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social norms reflect shared expectations on acceptable behavior. Measuring social norms alignment remains challenging, with existing approaches typically relying on artificial closed-form evaluations such as multiple-choice questionnaires or measuring agreement with predefined statements. In the context of this work, social norms alignment refers to measuring an agreement between solutions with respect to the social problem or dilemma. We propose a framework for measuring social norm alignment in naturalistic, free-form settings through solution matching. The framework enables us to measure alignment between any two dilemma responses e.g., LLMs to a human, LLMs to LLMs, or human to human. We introduce two metrics: stated and explicit agreement accuracy, and construct a dataset of 3k non-trivial social dilemmas in Danish. All dilemmas are assigned reference solutions derived from three panelists, who serve as culturally grounded judges. We evaluate the agreement of several LLMs and human responses in an interaction setup that resembles natural user-model conversations. Our results show that the proposed metrics produce consistent model rankings and reveal variation in agreement across different types of dilemmas, with higher agreement observed for topics such as neighbor conflicts and shared living situations. Overall, our work introduces a dataset and evaluation framework for studying culturally grounded social reasoning in naturalistic open-ended conversations.

[NLP-22] Articulatory strategy as a source of variation in acoustic vowel dynamics

【速读】：该论文试图解决的问题是：不同发音者的特定发音策略（articulatory strategies）是否系统性地影响元音的声学动态特征（如共振峰过渡），从而解释语音个体差异的成因。解决方案的关键在于，通过超声舌部成像技术（ultrasound tongue imaging）对36名北部盎格鲁英语使用者发舌面高前元音/i/时的舌头形状进行分析，发现舌头根部和舌背的运动幅度与共振峰动态变化之间存在显著关联：更大的舌体位移会导致元音/i/的平均舌形偏离程度增加，并引发更高的发音器移动速度，进而导致共振峰过渡更早且更陡峭。这一机制揭示了发音器官运动特性如何受声道结构约束，从而在个体层面形成稳定的声学差异，为理解语音个体性提供了基于发音补偿规律的理论支持。

链接: https://arxiv.org/abs/2605.23416
作者: Patrycja Strycharczuk,Justin J. H. Lo,Sam Kirkham
机构: University of Manchester (曼彻斯特大学); Lancaster University (兰卡斯特大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Acoustic vowel dynamics have some speaker-identifying characteristics, which have been ascribed to individual properties of articulatory strategies: formant transitions have a particular shape because speakers move their articulators, using specific and practised movements. However, there is little existing evidence that different articulatory strategies systematically affect formant dynamics. The present study corroborates the link between the two. Ultrasound tongue imaging data from 36 speakers of Northern-Anglo English are used to identify distinct articulatory strategies for the production of palatal vowel /i/. Tongue shape in /i/ is found to be a significant predictor of formant dynamics in diphthongs with a palatal offglide. The observed relationships can be explained by the characteristics of articulatory movement conditioned by vocal tract shape. Greater articulatory displacement of tongue root and/or dorsum produces greater distortion from the mean tongue shape in palatal vowels, and it also requires higher articulatory velocities, resulting in relatively earlier and steeper formant transitions. The results contribute to the conceptual understanding of individuality in speech, by illuminating the regularising and individual aspects of articulatory compensation.

[NLP-23] EquiSumm : A Gender Bias-Aware Framework for Inclusive Tweet Summarization

【速读】：该论文试图解决的问题是：当前自动摘要技术在处理社交媒体（如Twitter）上的海量观点时，虽然能够生成简洁的信息摘要，但未显式考虑人口统计学公平性（如性别差异），导致生成的摘要可能存在偏见，无法公正反映不同群体的观点。解决方案的关键在于提出EquiSumm方法，该方法在生成摘要时显式纳入性别维度，确保不同性别的意见得到均衡代表，从而提升摘要的公平性和代表性；实验结果表明，该方法在两个主流数据集上相较于现有研究具有更好的性能表现。

链接: https://arxiv.org/abs/2605.23412
作者: Chaitanya Wanjari,Jessica Kamal,Riddhi Jain,Samruddhi Kurhe,Roshni Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at AI for Social Good Workshop, Pattern Recognition and Machine Intelligence (PReMI 2025), IIT Delhi. 6 pages, 2 figures

点击查看摘要

Abstract:While social media platforms, such as Twitter, provide a medium for large-scale opinion sharing during news events, it is manually impossible for individuals or media agencies to process the vast volume of content to identify key viewpoints. In order to resolve this, several automatic summarization techniques have been proposed to condense large collections of tweets into concise and informative summaries. However, these algorithms do not explicitly consider demographic fairness. Several existing research works have developed automated summarization approaches that can provide a holistic overview of the key aspects and major opinions shared on social media platforms related to a news event. However, these approaches do not explicitly consider different forms of demographic representation, such as gender, which can lead to biased summary representation. In this paper, we propose EquiSumm, which considers the gender aspect of the shared opinion to generate a summary, and our experimental analysis on two major datasets indicates the performance effectiveness with respect to existing research works.

[NLP-24] Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

【速读】：该论文试图解决当前基于强化学习（Reinforcement Learning, RL）的大型语言模型（Large Language Models, LLMs）在推理能力提升过程中存在的两个关键问题：一是验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）仅依赖最终结果的可执行检查或标准答案，难以指导中间推理行为；二是评分规则作为奖励（Rubrics-as-Reward, RaR）虽能评估推理质量，但需针对每个任务设计特定评分标准，成本高且泛化性差。解决方案的关键在于提出一种受元认知（Metacognition）启发的强化学习框架——元认知即奖励（Metacognition-as-Reward, MaR），其通过两个通用过程维度引导LLM推理：(1) 元认知知识（metacognitive knowledge），自动识别与任务相关的信息而无需人工定制评分规则；(2) 元认知调控（metacognitive regulation），动态规划和调整推理流程以提供超越最终答案的奖励信号。MaR将模型生成轨迹显式分解为元认知组件，并基于任务知识覆盖度、调控一致性及最终答案正确性的轨迹级奖励进行优化，从而将奖励反馈扩展至整个推理路径，同时确保奖励信号扎根于通用的元认知结构。实验表明，MaR在22个基准测试中稳定提升性能，相较基线模型最高提升7.7%，相较基础DAPO方法提升11.0%，且Qwen3.5-9B + MaR在整体平均表现上超越GPT-OSS-120B，并在多个单项任务上优于更强模型，过程分析进一步验证了推理质量的显著改善，且该方法具备良好的跨域泛化能力。

链接: https://arxiv.org/abs/2605.23384
作者: Sirui Chen,Lei Xu,Yuying Zhao,Yutian Chen,Yu Wang,Beier Zhu,Hanwang Zhang,Shengjie Zhao,Chaochao Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.

[NLP-25] From Correctness to Preference: A Framework for Personalized Agent ic Reinforcement Learning

【速读】：该论文旨在解决现实世界中智能体（agent）应用面临的个性化行为建模问题，即同一任务对不同用户可能需要不同的规划策略和工具使用决策，而传统强化学习方法难以捕捉这种异质性用户偏好。其核心挑战在于：通用奖励信号无法反映用户差异、观察到的行为受从众效应干扰、以及扁平的记忆结构难以支持个性化技能的检索。解决方案的关键是提出一个统一的个性化增强式强化学习框架——PARPO（Personalized Anchor Reward-Decoupled Policy Optimization），通过将通用任务质量奖励与个性化偏好奖励解耦，并引入用户特定锚点（anchor）来稳定不同奖励尺度下的学习过程；同时结合两阶段偏好解耦奖励模型和偏好对齐技能演化图记忆（Preference-Aligned Skill Evolution Graph Memory, PSGM），实现偏好识别、策略优化与结构化技能积累的闭环。实验证明该框架在ETAPP、ETAPP-Hard和SJAgent数据集上显著优于现有基线方法。

链接: https://arxiv.org/abs/2605.23382
作者: Ranxu zhang,zeyang li,Jiacheng Huang,Rui Zhang,Xiaozhou Xu,sun zhe,Yanyong Zhang,Chao Wang
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: 34 pages, 7 figures, Under Review

点击查看摘要

Abstract:Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emphPersonalized Anchor Reward-Decoupled Policy Optimization (\textbfPARPO), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emphPreference-Aligned Skill Evolution Graph Memory (\textbfPSGM) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.

[NLP-26] Cultural Adaptation in Large Language Models for Political Discourse

【速读】：该论文试图解决的问题是：将大语言模型（Large Language Models, LLMs）应用于政治传播分析时，因文化适应性不足而导致的系统性错误与民主问责风险。当前模型普遍依赖英语主导的数据集、多语言覆盖不均，以及基于有限政治制度和话语惯例的假设，在跨文化场景中表现出显著偏差。解决方案的关键在于建立“文化适应性”（cultural adaptation）的三层次框架——翻译层、话语层与本体层，并提出一个以文化忠实度（cultural fidelity）、校准性（calibration）和民主安全性（democratic safety）为核心的可操作评估矩阵；同时通过参与式数据集构建、文化敏感的迁移学习方法及文化适应性的基准测试设计，使文化适应性成为可测量、可验证的研究实践，从而在治理约束与适用条件下增强政治自然语言处理（Political NLP）对民主合法性的支持能力。

链接: https://arxiv.org/abs/2605.23332
作者: Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of large language models into political discourse analysis creates new opportunities for comparative research, policy analysis, and civic technology, while introducing material risks for democratic accountability. This paper argues that cultural adaptation is a prerequisite for trustworthy deployment of large language models in political communication across diverse linguistic and institutional contexts. Current systems remain shaped by English dominant data, uneven multilingual coverage, and assumptions grounded in a narrow range of political institutions and discourse conventions, producing systematic errors when applied across cultures. We formalize cultural adaptation across translation, discourse, and ontology levels, identify recurring cultural failure modes in political NLP, and propose an operational evaluation matrix grounded in cultural fidelity, calibration, and democratic safety. Building on political text analysis, sociotechnical auditing, and cross cultural pragmatics, we outline methodological pathways including participatory dataset development, culturally aware transfer learning, and benchmark design that makes cultural adaptation empirically measurable. We conclude by clarifying governance constraints and scope conditions under which culturally adaptive political NLP can support democratic legitimacy.

[NLP-27] Emotion Recognition in Sign Language Conversation

【速读】：该论文试图解决的问题是：当前手语情感识别（Emotion Recognition in Conversation, ERC）研究缺乏对话情境下的数据资源，导致现有模型在真实场景中表现不佳，因为它们无法利用历史对话流信息。解决方案的关键在于提出一个新的任务框架——将ERC引入手语视频分析，并构建了首个面向手语对话的情感识别数据集eJSL Dialog。该数据集包含480个独特对话共1920个视频样本，基于STUDIES语料库脚本构建。实验表明，通用多模态对话情感识别模型在手语场景中存在领域差距，因此亟需开发针对手语特性的上下文感知视觉特征提取器，并推动更大规模对话数据集的建设以支持大规模预训练。

链接: https://arxiv.org/abs/2605.23328
作者: Yusong Wang,Keyu Mao,Takao Obi,Minghao Shao,Kotaro Funakoshi
机构: Institute of Science Tokyo (东京科学研究所); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotion Recognition in Conversation is a core component of affective computing, while current resources of sign language emotion datasets primarily focus on isolated sentences and lack conversational context. Models trained exclusively on these isolated utterances demonstrate degraded performance in real world scenarios because they cannot utilize historical dialogue flow. To address this structural limitation, we introduce the ERC task to sign language video analysis and propose the eJSL Dialog dataset. Constructed using the scripts from the STUDIES corpus, the dataset contains 1,920 video samples organized into 480 unique dialogues. We conduct systematic benchmarking on this dataset using models ranging from isolated visual networks to multimodal conversational architectures. The results reveal a domain gap when applying generic multimodal conversational emotion recognition models to sign language. These findings demonstrate the explicit need for context aware visual extractors specific to sign language and indicate that expanding the scale of conversational datasets to support large scale pre-training is a necessary next step for future research.

[NLP-28] ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication

【速读】：该论文试图解决的问题是：如何系统性地分析社交媒体上关于气候变化的公众话语动态，特别是识别其主题分布、情感倾向及受传播因素（如内容形式、情绪强度和发布主体）影响的互动模式。解决方案的关键在于构建并公开发布ClimateChat-300K这一大规模、多维度的Facebook帖子数据集，包含299,329条来自全球超过26,000个页面的气候相关帖子及其41项元数据特征（如语言、时间戳、互动量、页面类别等），并通过主题建模与情感分析识别出五大领域下的十个核心主题，并揭示视觉丰富且情绪强烈的帖子在受众参与度上显著更高，同时追踪重大事件（如气候峰会、新冠疫情）对在线讨论演化的影响，从而为极化、虚假信息传播及数字气候话语动态提供可复现、跨学科的研究基础。

链接: https://arxiv.org/abs/2605.23326
作者: Wajdi Zaghouani,Md. Rafiul Biswas,Mabrouka Bessghaier,Shimaa Ibrahim,George Mikros
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present ClimateChat-300K, a large-scale dataset of 299,329 public Facebook posts about climate change collected between May 2020 and May 2024 through the CrowdTangle platform. The dataset contains 41 metadata features including post content, engagement metrics, and page attributes, covering material from more than 26,000 global pages. Each post includes rich contextual information such as language, timestamp, page category, and interaction counts, enabling comprehensive analyses of public discourse around climate communication. Using topic modeling and sentiment analysis, we identify ten main themes grouped into five domains: policy, activism, cooperation, science, and conservation. The results reveal that emotional tone, post format, and page identity strongly influence audience engagement, with visually rich and emotionally charged content receiving the highest levels of interaction. The dataset also demonstrates how online discussions evolved in response to major events such as international climate summits and the COVID-19 pandemic period. ClimateChat-300K provides an open resource for reproducible and interdisciplinary research on polarization, misinformation, and the dynamics of digital climate discourse. By releasing this dataset, we aim to support transparent, data-driven research and contribute to a deeper un-derstanding of how public engagement with climate issues develops across time, geography, and institutional contexts.

[NLP-29] AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse

【速读】：该论文试图解决的问题是：在武装冲突期间，阿拉伯语社交媒体中积极、建设性话语（如希望言论）长期被忽视，缺乏系统性的数据支持与研究基础。解决方案的关键在于构建首个面向阿拉伯语的希望言论标注数据集——AraHopeCorpus，该数据集包含来自2023至2024年加沙战争相关YouTube评论的10,000条标注文本，采用细粒度标注框架将评论分为三类：希望言论、无希望言论和中立/模糊内容。结果显示，超过64%的评论属于希望言论，主要体现为宗教鼓励、集体团结及对正义与坚韧的乐观态度；同时，该研究通过高一致性标注（Cohen’s Kappa = 0.71）验证了标注可靠性，并揭示出方言差异、讽刺和隐含意义等挑战，进一步指出大语言模型（如ChatGPT）虽可辅助标注，但在处理文化嵌入式表达方面仍存在局限。这一资源为未来希望言论检测、危机传播与阿拉伯语数字韧性研究提供了重要支撑。

链接: https://arxiv.org/abs/2605.23325
作者: Esra’a Sharqawi,Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social media has become a crucial arena for shaping public narratives during armed conflicts, providing space for both harmful and constructive communication. While hate speech and misinformation have been widely studied, expressions that promote resilience, solidarity, and optimism remain underexplored, particularly in Arabic contexts. This paper introduces AraHopeCorpus, the first annotated dataset of Arabic hope speech collected from ten thousand YouTube comments related to the war on Gaza between 2023 and 2024. Using a detailed annotation framework, comments were classified into three categories: hope speech, no hope speech, and neutral or unclear discourse. The dataset shows that hopeful language dominates, accounting for more than sixty four percent of all comments. These expressions of hope appear mainly as religious encouragement, collective solidarity, and optimism for endurance and justice. No hope speech, representing about thirteen percent, reflects despair and disillusionment, while the rest of the comments contain neutral or mixed content. Inter-Annotator Agreement reached substantial levels (Cohen’s Kappa equals 0.71), though dialectal variation, sarcasm, and implicit meaning posed annotation challenges. A comparative analysis between human annotators and ChatGPT revealed that large language models can support annotation but remain limited in handling dialectal and culturally embedded expressions. AraHopeCorpus will be released for research purposes under an open and non commercial license. It provides a valuable resource for studying constructive digital discourse, enabling further research on hope speech detection, crisis communication, and resilience in Arabic social media.

[NLP-30] Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

【速读】：该论文试图解决的问题是：尽管大型语言模型（Large Language Models, LLMs）在不同训练目标和架构下展现出越来越相似的内部表征，这种表征收敛是否也扩展到基于共享表征进行推理的过程尚不明确。解决方案的关键在于通过系统性评估16个来自8个模型家族（参数规模从1.5B到72B）的语言模型在800个跨数学、科学、常识和真实性领域的推理任务上的表征相似性，同时按问题难度、计算阶段和因果相关性分层分析。研究发现三个关键 dissociations（分离现象）：难度反转（模型在共同失败的问题上表征更一致，CKA=0.897 vs. 成功问题CKA=0.830）、生成代沟（决策前表征高度一致，CKA=0.875，而决策后则显著分歧，CKA=0.274），以及表征正确性的附带性（共享信息可被解码并跨模型迁移，但对预测结果影响极小，扰动后预测翻转率仅1.5%–5.5%）。这些结果表明，LLM中的表征收敛反映的是输入处理约束的一致性，而非推理策略的共享，这对集成设计、可解释性迁移及模型相似性评估具有直接意义。

链接: https://arxiv.org/abs/2605.23315
作者: Muhammad Usama,Dong Eui Chang
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre-decision representations align (CKA = 0.875) while post-decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at this https URL.

[NLP-31] When Is Next-Token Prediction Useful? Marginalization Ergodicity Mixture Identifiability Local Sufficiency RAG Tools and Programming

【速读】：该论文试图解决的问题是：当前语言模型训练常被理解为学习“给定前序词时下一个词的条件分布”，但这种描述存在根本性偏差。实际上，模型仅从观测到的文本轨迹中学习有限样本，而真实语言生成不仅依赖于文本上下文，还受非文本因素（如事实、意图、社会语境等）影响。解决方案的关键在于区分三个易被混淆的对象：（1）基于隐变量条件的完整语言过程；（2）对隐变量积分后得到的纯文本边缘过程；（3）从有限语料中学习到的模型诱导分布。作者指出，将模型训练解释为估计边缘文本分布需强假设（平稳性、代表性、遍历性），这些在异质语料中往往不成立；即使成立，也只有当观察到的前缀能近似充分统计隐变量时，该边缘分布才有用——即信息论意义上的残余条件互信息（next token 与省略隐变量之间的条件互信息）需较小。论文进一步将检索增强生成（RAG）和工具使用视为实现条件充分性的机制。

链接: https://arxiv.org/abs/2605.23278
作者: Francesco Corielli
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices. Subjects: Computation and Language (cs.CL); Machine Learning (stat.ML) Cite as: arXiv:2605.23278 [cs.CL] (or arXiv:2605.23278v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.23278 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Francesco Corielli [view email] [v1] Fri, 22 May 2026 06:34:17 UTC (19 KB)

[NLP-32] Multi-Gate Residuals

【速读】：该论文试图解决深度残差网络中激活值（activation）在深层传播时无界增长的问题，这一问题会引发训练不稳定和性能下降。现有方法如Attention Residuals虽能缓解此问题，但引入了显著的通信开销，限制了其在大规模训练中的应用。论文提出的解决方案是多门控残差（Multi-Gate Residuals, MGR），其关键在于设计了一个轻量级的评分与门控机制，能够在不增加通信负担的前提下维持多流上下文信息，并结合注意力池化（Attention Pooling）从各流状态中提取隐状态，从而稳定激活尺度并提升模型性能。实验表明，MGR在大规模训练和部署中具有实用性，且优于现有架构。

链接: https://arxiv.org/abs/2605.23259
作者: Zhizhan Zheng,Feiyun Zhang,Shuchun Liu,Tian Xia,Xi Liu,Dasheng Hu,Hongquan Zhou
机构: Shanghai Yichuang Information Technology Co.,Ltd.(上海易创信息技术有限公司); Fudan University(复旦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Attention Residuals has shown some effectiveness in addressing the widespread issue of unbounded activation growth across deep residual layers, it inevitably incurs significant communication overhead. To circumvent this bottleneck, we propose Multi-Gate Residuals (MGR), which stabilizes activation scales without additional communication burden. It utilizes a straightforward scoring and gating mechanism to maintain multi-stream context, coupled with Attention Pooling to extract hidden states from the stream states. Empirical experiments demonstrate that MGR is practical for large-scale training and deployment, offering tangible performance improvements over existing architectures.

[NLP-33] FastKernels: Benchmarking GPU Kernel Generation in Production

【速读】：该论文试图解决当前基于大语言模型（LLM）的GPU内核生成代理在实际部署中表现不佳的问题，其根本原因在于现有基准测试与生产推理框架之间存在严重错位。现有基准仅在单个GPU上使用合成输入评估内核，忽略了编译栈的影响，并奖励重复已知优化而非发现新优化，导致代理学习到的内核虽在沙箱环境中得分高，但在真实系统中引入接口不兼容、编译栈冲突和隐性正确性下降等问题。解决方案的关键在于提出FastKernels——一个由46个代表性架构组成的最小化内核基准，覆盖HuggingFace Transformers中96.2%的架构，同时作为轻量级、生产就绪的推理框架，其接口与主流库模块一致，可直接部署优化后的内核。实验表明，即使最强的内核代理在FastKernels上也仅实现0.94倍的整体加速，远低于理想效果，验证了基准-生产错位是该领域发展的关键瓶颈。

链接: https://arxiv.org/abs/2605.23215
作者: Gabriele Oliaro,Yichao Fu,May Jiang,Owen Lu,Junli Wang,Zhihao Jia,Hao Zhang,Samyam Rajbhandari
机构: Snowflake AI Research; CMU; UCSD; Independent Researcher
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task’s interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94 \times aggregate speedup over production baselines, with weaker agents at 0.78\times and 0.53\times – confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at this https URL

[NLP-34] Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

【速读】：该论文试图解决的问题是：当前基于段落级别的机器生成文本（MGT）检测方法往往将MGT视为完全机器化的内容，忽略了其中可能存在的隐藏人类写作特征（即“隐藏人类类比片段”），这些片段会显著增加检测难度。解决方案的关键在于提出一种模型无关的堆叠增强框架（model-agnostic stacked enhancement framework），通过将片段级保留决策建模为潜在变量问题，并采用类似硬期望最大化（hard-EM）的优化策略，使检测器在迭代过程中逐步过滤出高度可信的人类类比子序列，并在剩余文本上自我精炼，从而有效降低隐藏人类类比片段对检测性能的干扰。实验表明，该框架可显著提升现有检测器的性能，且支持无需训练的部署方式，具备良好的灵活性和可扩展性。

链接: https://arxiv.org/abs/2605.23190
作者: Chenwang Wu,Yiu-ming Cheung,Bo Han,Defu Lian
机构: Hong Kong Baptist University (香港浸会大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine-generated texts (MGTs) produced by large language models (LLMs) are increasingly prevalent across various applications, while their potential misuse in fake news propagation and phishing has raised serious concerns, highlighting the need for MGT detection. Existing paragraph-level detection methods commonly treat MGTs as entirely machine-like, overlooking the hidden human-like nature of machine-generated texts: even fully machine-generated texts may contain spans that are highly consistent with human writing. To this end, we first reveal the existence of such hidden human-like spans, and then theoretically analyze their impact on detection. Our analysis shows that these spans increase the sentence complexity for detection, thereby making MGT detection intrinsically harder. Based on this finding, we propose a model-agnostic stacked enhancement framework that improves existing detectors by reducing the influence of hidden human-like spans. Specifically, we model span-level retention decisions as a latent-variable problem and instantiate the optimization with a hard-EM-inspired procedure, where the detector iteratively filters confidently human-like subsequences and refines itself on the remaining text. Extensive experiments across various LLMs and practical scenarios demonstrate that the proposed framework consistently enhances existing detectors. Notably, the framework can also work in a training-free manner, offering flexibility and scalability for practical deployment.

[NLP-35] Self-Improving In-Context Learning

【速读】：该论文旨在解决提示词在上下文学习（In-Context Learning, ICL）中表现不稳定的问题，即如何通过优化固定少样本提示（few-shot prompt）的连续嵌入（continuous embeddings），提升模型在测试时对任务的理解能力。其解决方案的关键在于：利用模型在单次前向传播中对演示输出分配的对数概率（log-probabilities）作为自监督置信度代理（confidence proxy），并通过零阶优化（zeroth-order optimization）最大化该代理信号，从而实现无需微调、无需生成token、无需预定义标签集或外部数据的测试时校准（test-time calibration）。这一方法适用于分类和自由文本生成任务，并在多个ICL任务上显著提升了下游性能，且验证了该代理信号与实际准确率提升之间存在统计显著的相关性。

链接: https://arxiv.org/abs/2605.23180
作者: Baturay Saglam,Dionysis Kalogerias
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose to improve in-context learning (ICL) by optimizing the continuous embeddings of a fixed few-shot prompt at test time. The key observation is that the log-probabilities a model assigns to its demonstrated outputs \unicodex2013 available from a single forward pass without generating any tokens \unicodex2013 provide a meaningful signal for how well the model has inferred the task from its demonstrations. We formalize this signal as a bounded, self-supervised confidence proxy and maximize it via zeroth-order optimization over the prompt embeddings, yielding a test-time calibration procedure. The approach requires no finetuning, no token generation, no predefined label set, and no external data, making it equally applicable to both classification and free-form generation tasks. Across a comprehensive suite of ICL tasks, the proposed calibration consistently matches or improves upon the base model and outperforms classification-specific baselines on most tasks. The statistically significant correlation between proxy improvement and downstream accuracy gain confirms that the proposed proxy encodes a reliable optimization signal for in-context learning.

[NLP-36] Robust LLM Watermarking with Minimal Semantic Distortion for IP Protection

【速读】：该论文旨在解决专有大语言模型（LLM）面临知识产权（IP）侵权风险的问题，即攻击者可通过收集输入-输出对训练替代模型来复制原模型，从而造成经济损失。现有水印方法在语义扭曲、事实不一致及对抗攻击下表现不佳，且针对提供方特定检测的键条件水印机制——尤其在跨提供方和多用户场景中——尚未得到充分研究。解决方案的关键在于提出SAFESEAL框架：通过键条件锦标赛采样机制，在保留命名实体的同时以上下文感知同义词替换语言项，实现最小化对模型效用的影响；同时引入键条件对比检测器，联合编码文本与密钥，实现提供方特异性且鲁棒的水印验证。此外，作者推导了效用-可检测性权衡的理论边界，并借助轻量模型、批处理与并行化显著降低延迟。实验表明，SAFESEAL在效用、可检测性和鲁棒性方面均优于基线，BERTScore达0.983，实体相似度0.963，检测率达98.2%，人类评分最高，且延迟与最快基线相当。为促进透明度与社区发展，作者发布了首个公开水印排行榜与交互式演示。

链接: https://arxiv.org/abs/2605.23175
作者: Kieu Dang,Phung Lai,NhatHai Phan,Yelong Shen,Ruoming Jin
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Proprietary large language models (LLMs) face risks of intellectual property (IP) violation, as adversaries can replicate an LLM by collecting input-output pairs to train a surrogate model, causing financial setbacks. Watermarks offer a promising defense to verify ownership, but existing methods often struggle with semantic distortion, factual inconsistency, and adversarial attacks. In addition, key-conditioned watermarks for provider-specific detection, especially in cross-provider and multi-user scenarios, remain largely underexplored. To address these challenges, we propose SAFESEAL, a novel key-conditioned watermarking framework that achieves strong detectability with minimal impact on model utility, effectively balancing detectability, utility, and robustness. SAFESEAL preserves named entities while substituting linguistic terms with context-aware synonyms through a key-conditioned Tournament sampling mechanism, maintaining semantic fidelity and factual consistency. For detection, we introduce a key-conditioned contrastive detector that jointly encodes the text and key, enabling provider-specific and robust watermark verification. We derive theoretical bounds on the utility-detectability trade-off and significantly reduce latency through lightweight models, batching, and parallelism. Extensive experiments show that SAFESEAL outperforms baselines in utility, detectability, and robustness, achieving a BERTScore of 0.983, entity similarity of 0.963, a 98.2% detection rate, and the highest human ratings for text quality and content preservation, with latency comparable to the fastest baseline. To promote transparency and community-driven progress, we release the first public watermark leaderboard and an interactive demo.

[NLP-37] Positional Failures in Long-Context LLM s: A Blind Spot in Reasoning Benchmarks

【速读】：该论文试图解决当前长上下文推理基准测试中缺乏对任务位置、填充内容和上下文长度的协同控制问题，从而导致模型性能评估存在结构性偏差。现有主流推理基准未像“针在 haystack 中”（Needle-in-a-Haystack, NIAH）或 RULER 这类定位控制型评估任务一样严格控制任务位置，使得模型在不同位置表现差异无法被准确测量。解决方案的关键在于提出 Context Rot Evaluation (CRE) 框架，该框架系统性地变化三个核心变量：目标任务位置（开头/中间/结尾）、填充内容类型（如 with_solutions 或 questions_only_v2）以及上下文长度（8K–64K），从而实现对模型位置敏感性的精准诊断。实验表明，许多大语言模型（LLMs）在任务从末尾移至中间时性能显著下降（最高达 88 个百分点），且这种下降随上下文长度加剧；而通过在末尾添加目标任务副本的诊断探针可使中间位置精度接近末端基线，进一步验证了位置效应是主要误差来源之一，而非模型能力不足。这一发现揭示了当前推理基准设计与厂商评测实践中存在的重大盲区：若不控制任务位置，则无法识别模型因上下文长度增长而放大的位置脆弱性。

链接: https://arxiv.org/abs/2605.23170
作者: Chuyifei Zhang,Hongyu Cui,Xiaowen Huang,Jitao Sang
机构: Beijing Jiaotong University (北京交通大学); Central South University of Forestry and Technology (中南林业科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 1 figure, 23 tables

点击查看摘要

Abstract:Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.

[NLP-38] Autonomous Frontier-Based Exploration with VLM Guidance CVPR2026

【速读】：该论文试图解决的是在未知且危险环境中实现自主机器人探索的难题，传统方法依赖于简单的几何启发式策略，难以适应复杂场景中的语义信息和空间上下文。解决方案的关键在于引入一种新颖的探索流水线，利用视觉语言模型（Vision-Language Models, VLMs）进行高层战略决策，指导低层机器人控制模块；具体而言，在决策点，机器人将当前地图与潜在路径的视觉图像整合为多模态提示（multimodal prompt），由VLM分析并选择最具前景的前沿区域（frontier），从而用基于上下文的空间推理替代传统几何启发式方法。该方案无需训练、轻量且可迁移，已在六种室内仿真环境中验证，使地图覆盖率提升最高达24%。

链接: https://arxiv.org/abs/2605.23165
作者: Aarush Aitha,Avideh Zakhor
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 10 figures, CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

点击查看摘要

Abstract:Autonomous robotic exploration of unknown and hazardous environments, a long-standing challenge, can be significantly improved by leveraging the advanced reasoning of Vision-Language Models (VLMs). We introduce a novel exploration pipeline where a VLM performs high-level strategic decision-making, guiding a conventional low-level robotics control stack. At decision points, the robot generates a multimodal prompt with its current map and visual imagery of potential paths, or frontiers. The VLM analyzes this prompt to select the most promising frontier, replacing simple geometric heuristics with contextual spatial reasoning. This approach, validated in simulation across six indoor environments, improves map coverage by up to 24% over existing methods. Our pipeline is lightweight, training-free, and easily transferable to any robot with standard sensors and an internet connection.

[NLP-39] Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

【速读】：该论文旨在解决端到端自动驾驶中视觉-语言-动作（VLA）模型在高保真轨迹规划与高效推理之间的平衡难题。现有方法存在两大瓶颈：自回归（AR）VLA受边缘硬件内存带宽限制且易产生暴露偏差漂移，而全序列扩散模型无法复用键值缓存（KV-cache），并因“逻辑泄露”破坏感知-规划的因果顺序。其解决方案的关键在于提出Fast-dDrive——一种基于块扩散（block-diffusion）的VLA架构，在语义单元内进行双向精修的同时严格保持跨单元的因果顺序；通过冻结结构化token形成“骨架”（scaffold），并设计面向安全关键规划的分段训练策略；进一步引入“骨架推测解码”（Scaffold Speculative Decoding）实现AR级质量下的显著吞吐提升；最后提出轻量级测试时缩放方案，利用共享前缀KV缓存并行生成多条随机轨迹后平均，以极低计算开销抑制预测方差。实验证明，Fast-dDrive在WOD-E2E和nuScenes数据集上均达到扩散类VLA最优的速度-精度权衡，并在集成SGLang后相较AR基线实现12倍吞吐加速，显著缩小了高性能VLA与车载实时部署效率需求之间的差距。

链接: https://arxiv.org/abs/2605.23163
作者: Kewei Zhang,Jin Wang,Sensen Gao,Chengyue Wu,Yulong Cao,Songyang Han,Boris Ivanovic,Langechuan Liu,Marco Pavone,Song Han,Daquan Zhou,Enze Xie
机构: Peking University; NVIDIA; The University of Hong Kong; MIT
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from “logical leakage” that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking N stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to 0.32 m (a 22% improvement). When integrated with SGLang, our framework delivers 12\times throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.

[NLP-40] What Does the Server See? Understanding Privacy Leakage from Large Language Models in Split Inference CCS’26

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在资源受限设备上部署时面临的隐私泄露问题，特别是针对分层推理（split inference）场景下中间激活值（intermediate activations）可能被重构从而泄露客户端输入的问题。其解决方案的关键在于提出ActInv方法，通过求解中间激活匹配问题实现对客户端输入的高保真重建；同时引入扰动放大因子（Perturbation Amplification Factor, PAF）量化各层对重建攻击的敏感性差异，揭示隐私漏洞在不同网络层分布不均的现象，并据此设计出PriPert防御机制——通过优化扰动方向以最大化反向传播中的重建误差，从而在保障模型性能的同时显著提升隐私保护能力。

链接: https://arxiv.org/abs/2605.23158
作者: Mingyuan Fan,Yu Liu,Fuyi Wang,Cen Chen
机构: East China Normal University (华东师范大学); RMIT University (皇家墨尔本理工大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACM CCS’26

点击查看摘要

Abstract:The deployment of large language models (LLMs) on resource-constrained devices remains challenging, spurring interest in split inference, where models are partitioned between client and server to reduce computational burden and enhance privacy by transmitting only intermediate activations. However, the privacy-preserving capabilities of split inference, particularly in the context of LLMs, have not been exhaustively investigated. To fill this gap, we introduce ActInv, which solves an intermediate activation matching problem to reconstruct the client’s input. Extensive evaluations demonstrate that ActInv achieves high-fidelity reconstructions, even in the presence of common perturbation-based defenses such as Gaussian noise injection and activation sparsification. To systematically understand this vulnerability, we develop Perturbation Amplification Factor (PAF), a metric for quantifying a layer’s inherent resistance to reconstruction. Our analysis reveals that privacy vulnerability is not uniform across layers, with some layers being highly susceptible to leakage while others offer natural resistance. Furthermore, we demonstrate that defense effectiveness can be significantly improved by calibrating perturbation directions to maximize reconstruction error during backpropagation. Building on these insights, we design PriPert and conduct comprehensive evaluations, covering privacy, utility, and computational overhead, to demonstrate its effectiveness.

[NLP-41] Same Model Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLM s

【速读】：该论文试图解决的问题是：多模态大语言模型（MLLM）在不同语言下的安全漏洞是否存在差异，以及这种差异是否揭示了对齐失败的机制结构。解决方案的关键在于通过系统性的跨语言、多模态红队测试，比较英语（en-US）和西班牙语（es-MX）环境下四种前沿MLLM（Claude Sonnet 4.5、GPT-5、Pixtral Large、Qwen Omni）的越狱攻击成功率，并结合贝叶斯混合效应模型分析发现：语言并非以统一方式影响漏洞暴露程度——具体而言，基于语言框架的攻击（如角色扮演）在西班牙语中显著失效，而视觉显式多模态攻击则更有效，这表明语言与视觉模态的对齐失败机制独立且可分离；进一步地，这一现象导致安全排名在不同语言间不一致（例如Qwen Omni在西班牙语中比Pixtral Large更易受攻击），且无法通过简单的英文评分校准恢复，从而揭示现有将语言与模态视为独立维度的安全评估框架存在根本性误设，亟需重构。

链接: https://arxiv.org/abs/2605.23157
作者: Casey Ford,Madison Van Doren,Sicheng Jin,Emily Dix
机构: Appen
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (en-US) and Mexican Spanish (es-MX) across four frontier MLLMs: Claude Sonnet 4.5, GPT-5, Pixtral Large, and Qwen Omni. Using a fixed adversarial benchmark of 363 diverse prompt scenarios administered in text-only and multimodal conditions, we collected 52,272 harm ratings and binary attack success judgements from matched panels of nine native-speaker annotators per language group. Our central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose that separation. The practical consequence is that safety rankings are not preserved across languages. Qwen Omni overtakes Pixtral Large as the most vulnerable model among es-MX participants, a rank reversal no scalar correction of English-condition scores could recover, and absolute attack success rates have declined across model generations without closing the gaps between them. These findings demonstrate that safety evaluation frameworks treating language and modality as independent dimensions fundamentally misspecify the attack surface of globally deployed MLLMs, and must be redesigned accordingly.

[NLP-42] When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

【速读】：该论文试图解决的问题是：如何评估大语言模型（Large Language Models, LLMs）在精神健康筛查中的可靠性，尤其是在不同诊断类别、人口学亚组和证据使用模式下的表现差异。解决方案的关键在于构建一个以SCID（Structured Clinical Interview for DSM Disorders）为锚定的基准数据集，包含555个半结构化体验访谈及其对应的诊断标签，并采用零样本任务特定提示（zero-shot task-specific prompting）方法对五种前沿LLM进行系统评估。研究发现，尽管部分模型（如GPT-4.1 Mini和GPT-5 Mini）在特定障碍分类上表现出较高一致性，但错误分类常源于模型对功能损害证据的敏感性不足以及对保护性情境证据（如社会支持或应对能力）的过度权重，这导致其在患者具备症状但功能完好时出现假阴性结果。因此，LLMs虽具规模化筛查潜力，但在临床部署前需针对其对症状与功能/保护性因素权衡机制进行严格验证。

链接: https://arxiv.org/abs/2605.23148
作者: Jianfeng Zhu,Megan Korhummel,Ruoming Jin,Karin G. Coifman
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 25 pages 7 figures

点击查看摘要

Abstract:As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability across diagnoses, demographic subgroups, and evidence-use patterns remains uncertain. We introduce a SCID-anchored benchmark of 555 semi-structured experiential interviews paired with diagnostic reference labels for anxiety disorder, major depressive disorder, post-traumatic stress disorder, and any current mental health disorder. Using zero-shot task-specific prompting, we evaluated five state-of-the-art LLMs and examined whether false-negative errors reflected missed psychiatric evidence or differential weighting of symptom, functional-impairment, and protective-context cues. Performance varied across tasks and models, with accuracy ranging from 0.49 to 0.86 and Matthews correlation coefficients from 0.16 to 0.38. GPT-4.1 Mini and GPT-5 Mini showed the most consistent disorder-specific accuracy. Subgroup analyses found higher depression-classification accuracy among male than female participants, no consistent age-related pattern, and modest non-uniform variation across race strata. Evidence-integration analyses showed that false-negative anxiety and PTSD classifications often contained explicit symptom evidence but were accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifted model outputs toward positive classifications, whereas protective-context evidence shifted outputs away. These findings suggest that LLMs may support scalable psychiatric screening, but their tendency to discount symptom evidence in the presence of preserved functioning or protective context requires careful validation before clinical deployment.

[NLP-43] As X Do Y: How Persona and Task Combine in Instruction-Tuned LLM s

【速读】：该论文试图解决的问题是：角色提示（role prompt）是否可以通过局部残差流中的加性结构进行压缩，从而用单一缓存的残差向量替代原始提示以实现等效输出。其解决方案的关键在于揭示了角色提示在特定位置（prompt-to-answer transition）确实存在可分解的加性结构——即人格（persona）和任务（task）效应分别沿部分正交方向叠加，形成纯人格效应 $\Delta_X$ 和纯任务效应 $\Delta_Y$ ，并可用 $h_{BB} + \Delta_X + \Delta_Y$ 近似清洁残差，从而在多个模型（Gemma-2-2B-IT、Qwen-2.5-1.5B/3B-Instruct）上保持下游输出的KL散度接近原结果，同时保留人格特异性行为标记。然而，研究进一步证明，这种局部加性结构并不意味着提示可压缩：即使注入缓存的加性预测或理想清洁残差 $h_{XY}$ 到去除了人格文本的基线提示中，也无法逼近原始长人格提示的输出表现，且该失败发生在单个位置及多层中。这表明人格条件下的多token生成依赖于注意力机制对提示中人格文本位置的持续回溯，而这种分布式提示/KV机制无法被任何单点残差激活算术所取代，因此局部加性不等于全局可压缩性。

链接: https://arxiv.org/abs/2605.23147
作者: Eric Xu
机构: Google(谷歌); Qwen(通义千问)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure. Code: this https URL

点击查看摘要

Abstract:Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition – the last prompt token together with the first two generated tokens – in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect \Delta_X , a pure task effect \Delta_Y , and substituting h_BB + \Delta_X + \Delta_Y for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\1.5B, 3B-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emphWe show it cannot. Injecting the cached additive prediction – or even the oracle clean residual h_XY – into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace. Comments: 12 pages, 1 figure. Code: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.23147 [cs.CL] (or arXiv:2605.23147v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.23147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-44] A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

【速读】：该论文试图解决的问题是：在古代中文文集（wenji）目录标题中，如何准确区分“书信”（personal letter）与语义相近的“序文”（preface，特别是告别序文），这类文本在形式上高度相似，传统人工标注成本高且易出错。解决方案的关键在于构建并训练一个基于BERT的分类模型——Lepton（Letter Prediction），该模型在5438个手工标注的晚明至清初文集标题数据集上进行微调（fine-tune），利用预训练语言模型对中文语境下的细微语义差异进行建模，从而实现自动化、高精度的文本类别识别；该模型已部署于Hugging Face，并应用于中国人物传记数据库（CBDB），成功识别出约五万五千封书信，为明代书信平台（Ming Letter Platform）的数据建设提供了关键支持。

链接: https://arxiv.org/abs/2605.23103
作者: Queenie Luo
机构: Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
备注:

点击查看摘要

Abstract:I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of contents is a personal letter or a closely confusable preface (particularly the farewell-preface). Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati. I’ve deployed the model on Hugging Face and has been used at the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji, populating the Ming Letter Platform.

[NLP-45] A Comparative Evaluation of Structural Topic Models and BERTopic for Short Open-Ended Survey Responses

【速读】：该论文试图解决的问题是：在应用心理学研究中，如何有效分析短文本的开放性调查回答，尤其是在传统基于概率的词袋模型（如结构化主题模型 STM）与新兴的嵌入式方法（如 BERTopic）之间进行比较和选择。解决方案的关键在于系统评估不同配置下的 STM 和 BERTopic 模型性能，特别是引入一种称为“上下文增强”（contextual augmentation）的新策略，以提升极短文本的语义信息；结果表明，BERTopic 在主题一致性上优于STM，尤其在使用上下文增强后表现最佳，而STM则在协变量推断分析中更具优势，二者呈现互补特性，为社会科学研究中的主题建模提供了实证依据和实践指导。

链接: https://arxiv.org/abs/2605.23093
作者: Yan Jiang,Sihong Liu,Philip A. Fisher
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Topic modeling in applied psychology increasingly spans two methodological traditions: probabilistic bag-of-words models and newer embedding-based approaches. Yet many evaluations of these methods rely on longer and cleaner benchmark corpora, leaving less guidance for short, open-ended survey responses. This paper compares Structural Topic Models (STM), a probabilistic topic model, and BERTopic, an embedding-based model, for analyzing open-ended survey responses. We evaluated three STM conditions and five BERTopic conditions, varying typographical correction, stemming, embedding choice, and contextual augmentation, a strategy we introduced to provide additional semantic context for very short responses. Results indicate that BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. In contrast, higher-dimensional embeddings alone did not improve coherence and were associated with greater data loss. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths. We conclude with practical guidance for selecting and combining topic modeling approaches in applied social science research.

[NLP-46] GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLM s ICML2026

【速读】：该论文旨在解决混合专家大语言模型（MoE-LLMs）在量化压缩过程中因专家参数量庞大而导致的显著内存开销问题，同时克服现有混合精度量化方法在专家级比特分配上的局限性。现有方法依赖于层级别的重要性估计，忽略了量化引起的路由器（router）偏移，从而导致次优的比特分配和路由决策。解决方案的关键在于提出全局专家级混合精度量化（GEMQ），其核心包括：(1) 一种基于量化误差分析的全局线性规划公式，用于捕捉全模型范围内专家的重要性；(2) 高效的路由器微调机制，使路由策略能够适应量化后的专家表现。这两个组件被整合进一个渐进式量化框架中，通过迭代优化重要性估计与比特分配，实现更低的内存占用和更快的推理速度，同时保持极小的精度损失。

链接: https://arxiv.org/abs/2605.23078
作者: Jianing Deng,Song Wang,Dongwei Wang,Zijie Liu,Tianlong Chen,Huanrui Yang,Jingtong Hu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICML 2026

点击查看摘要

Abstract:Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at this https URL .

[NLP-47] he Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在长上下文处理中面临的计算与财务成本高昂的问题，以及现有上下文缩减方法（如检索和记忆压缩）因独立评估性能与效率指标而导致难以系统比较和部署决策受限的局限性。其解决方案的关键在于提出“效率前沿”（Efficiency Frontier）框架，这是一个统一的成本-性能优化范式，将上下文策略选择建模为一个考虑任务性能、token成本及预处理复用性的部署感知优化问题，并通过摊销成本建模实现不同策略在不同运行条件下的偏好边界分析。实验基于5000个HotpotQA实例验证了该框架的有效性，揭示了检索型与预处理型策略之间的操作区间和过渡边界，表明该方法可在保持相近性能（F1 ≈ 0.78）的前提下降低约25%的有效token使用量，且在高性能场景下，摊销后的内存压缩相比全上下文提示可减少超过50%的token成本，从而为构建可扩展、高效且可持续的LLM系统提供了理论严谨且实践可行的评估与部署基础。

链接: https://arxiv.org/abs/2605.23071
作者: Binqi Shen,Lier Jin,Hanyu Cai,Lan Hu,Yuting Xin
机构: Northwestern University (西北大学); Duke University (杜克大学); Carnegie Mellon University (卡内基梅隆大学); University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and memory compression methods, are typically evaluated using performance and efficiency metrics independently, limiting systematic comparison and deployment-aware decision-making. This paper introduces The Efficiency Frontier, a unified framework for cost-performance optimization in LLM context management. The framework models context strategy selection as a deployment-aware optimization problem that jointly accounts for task performance, token cost, and preprocessing reuse through amortized cost modeling. Unlike existing evaluations that compare methods in isolation, the proposed framework enables decision-oriented analysis of when different context management strategies become preferable under varying operational conditions. Evaluated on 5,000 HotpotQA instances, the framework reveals distinct operational regimes and transition boundaries between retrieval-based and preprocessing-based strategies. Results show that deployment-aware optimization reduces effective token usage by approximately 25% at comparable performance ( F1 \approx 0.78 ), while amortized memory compression achieves over 50% lower token cost relative to full-context prompting in higher-performance settings. Overall, the proposed framework provides a principled and practical foundation for evaluating and deploying scalable, efficient, and sustainable LLM systems. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.23071 [cs.CL] (or arXiv:2605.23071v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.23071 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-48] DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在跨语言和跨文化场景中文化知识分布不均的问题，即模型在不同语言和地区间的文化理解能力存在显著差异。其解决方案的关键在于采用激活操控（activation steering）技术，通过从平行语料库FLORES中提取的语言向量，在推理阶段向选定的Transformer层残差流（residual stream）注入语言特定的引导向量，从而实现无需参数更新的文化适应性调整。实验表明，该方法在多选题（MCQ）任务中达到86.96%准确率，位列第7名；但后验分析揭示其效果具有明显的层敏感性和语言-区域异质性，部分配置甚至导致性能下降，且与提示设计密切相关，因此强调需联合优化提示工程与激活操控以提升多语言文化推理能力。

链接: https://arxiv.org/abs/2605.23069
作者: Yusser Al Ghussin,Daniil Gurgurov,Yasser Hamidullah,Josef van Genabith,Cristina España-Bonet,Simon Ostermann
机构: German Research Center for Artificial Intelligence (DFKI GmbH); Saarland Informatics Campus; Barcelona Supercomputing Center (BSC-CNS)
类目: Computation and Language (cs.CL)
备注: Accepted to The 20th International Workshop on Semantic Evaluation at ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used across diverse linguistic and cultural contexts, yet their cultural knowledge remains uneven across regions and languages. We present the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, where we apply activation steering to multilingual LLMs using language vectors extracted from parallel FLORES data. Our method performs inference-time adaptation by adding language-specific steering vectors to the residual stream at a selected transformer layer, without any parameter updates. We participated in both the short-answer (SAQ) and multiple-choice (MCQ) tracks; however, only our MCQ submission received an official score. In the official MCQ track, we achieved 86.96% accuracy, ranking 7th out of 17 teams. To better understand system behavior, we conduct post-hoc analyses on the shared-task MCQ and SAQ settings. These analyses show that activation steering yields modest and heterogeneous improvements on cultural reasoning: gains are strongly layer-sensitive, vary substantially across language-region pairs, with some configurations even degrading performance, and interact with prompt formulation, comparing generic and culturally conditioned prompts. Our findings suggest that prompt design and activation steering should be jointly optimized for culturally aware multilingual inference.

[NLP-49] What Training Data Teaches RL Memory Agents : An Empirical Study of Curriculum Effects in Memory-Augmented QA

【速读】：该论文试图解决的问题是：训练数据的组成如何影响基于强化学习（Reinforcement Learning, RL）的大型语言模型（Large Language Model, LLM）代理在多轮对话中对外部记忆库进行推理的能力，尤其是不同训练课程（curriculum）对模型技能特化（specialization）的影响机制尚不明确。解决方案的关键在于设计了一个受控实验，固定模型架构、RL算法和超参数，仅改变训练课程条件，包括三种情形：领域内（LoCoMo）、跨基准混合（LoCoMo + LongMemEval）和领域外（LongMemEval only）。结果表明，课程组成并非简单提升整体性能的因子，而是精细调节模型技能特化的杠杆；其中混合课程在两个评估集上均获得最佳F1分数，而纯领域外训练虽整体表现弱，却能有效迁移特定技能（如时间推理），且各问题类型的性能差异远大于平均指标差异，揭示了单一数值基准评估可能严重低估课程设计的实际影响。此外，研究还发现，在单GPU环境下适配GRPO算法时，需过滤记忆库中的格式噪声以保留训练信号，且小批次规模（G=4）下使用二值精确匹配奖励无法产生有效学习信号，从而推动采用连续奖励函数的设计策略。

链接: https://arxiv.org/abs/2605.23067
作者: Xinjie He,Zhiyuan Lin,Su Liu,Jialun Wu,Qiyang Xie,Weikai Zhou,Shuai Xiao
机构: Columbia University (哥伦比亚大学); Johns Hopkins University (约翰霍普金斯大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 2 figures, 11 tables. Code, checkpoints, and evaluation artifacts available at this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.

[NLP-50] ModeSwitch-LLM : A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

【速读】：该论文试图解决单GPU环境下大语言模型（Large Language Model, LLM）推理效率低下的问题，尤其是在不同请求具有异构负载特征时，静态配置难以兼顾性能、能效与质量。解决方案的关键在于设计一个轻量级的请求边界控制器ModeSwitch-LLM，它基于低成本的工作负载级特征，在FP16、量化模式、推测解码（speculative decoding）以及混合模式（如GPTQ+前缀缓存、INT8+连续批处理）之间动态路由每个请求，从而实现高效的资源适配。实验表明，该方法在单张NVIDIA A100 GPU上实现了平均延迟提升2.10倍、单位token能耗降低51.7%，且精度损失极小（仅+0.17个百分点），证明了规则驱动的简单请求感知路由策略在不修改模型架构或重新训练的前提下即可显著释放现有推理模式的效率潜力。

链接: https://arxiv.org/abs/2605.23057
作者: Aman Sunesh,Ali Alshehhi,Hivansh Dhakne
机构: NYU Abu Dhabi (纽约大学阿布扎比校区); New York University (纽约大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Performance (cs.PF)
备注: 10 pages main text, 11 pages including references, 5 figures, 3 tables. Preprint

点击查看摘要

Abstract:ModeSwitch-LLM is a lightweight request-boundary controller for improving single-GPU large language model inference efficiency by routing each request to an appropriate fixed inference mode. Instead of relying on one static serving configuration, the system selects among FP16, quantized modes, speculative decoding, and hybrid modes such as GPTQ plus prefix caching and INT8 plus continuous batching using cheap workload-level features. We evaluate ModeSwitch-LLM on Meta-Llama-3.1-8B-Instruct served on a single NVIDIA A100 GPU. On deployment-style synthetic workloads, the online controller achieves a 2.10x mean latency speedup over FP16 and a 0.48x mean energy ratio, corresponding to 51.7% lower energy per token. On automatic benchmarks used as a quality gate, accuracy remains close to FP16 with a mean delta of +0.17 percentage points. We also evaluate lightweight learned routers, but find that they do not clearly outperform the rule-based controller because they add routing overhead and more often select modes that violate quality, energy, or memory constraints. These results show that simple request-aware routing can recover substantial efficiency from existing inference modes without retraining the model or changing its architecture.

[NLP-51] Decomposing and Measuring Evaluation Awareness

【速读】：该论文试图解决的问题是：前沿语言模型在评估过程中可能意识到自己正在被测试，并据此调整行为，从而影响基准测试结果的有效性。现有研究缺乏统一的理论基础，混淆了评估环境属性与模型自身特性，以及检测到评估与实际行为响应之间的区别。解决方案的关键在于将“评估意识”（evaluation awareness）从社会心理学角度进行解构，区分出环境因素（如任务可识别性）和模型因素（即识别能力与行为倾向），并通过实证方法量化二者的影响。作者提出了一种名为EvalAwareBench的新基准，包含100对安全-能力任务，其中8类触发因素（如占位符实体、评分格式等）可独立控制，从而在固定请求下系统性地操纵评估信号。实验表明，识别率取决于模型与基准的特定组合，且识别并不必然导致行为改变；当行为发生变化时，其方向取决于模型对评估类型的感知。此外，模型对安全性评估更敏感，使安全基准面临更高风险。该框架为测量、归因和缓解评估意识提供了工具，并指出“在识别条件下保持行为一致性”是未来改进的方向。

链接: https://arxiv.org/abs/2605.23055
作者: Changling Li,Terry Jingchen Zhang,Jie Zhang,Zhijing Jin,Sahar Abdelnabi,Maksym Andriushchenko
机构: ETH Zürich (苏黎世联邦理工学院); ELLIS Institute Tübingen (图宾根ELLIS研究所); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Tübingen AI Center (图宾根人工智能中心); University of Toronto Vector Institute (多伦多大学向量研究所); EuroSafeAI (欧洲安全AI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component that separates recognition from propensity to act on it. We operationalize the environment component through eight categorized trigger factors, such as placeholder entities and grading-style output formats, and study recognition and behavior through chain-of-thought monitoring. Across nine frontier models and four benchmarks, recognition rates depend on the specific pairing of model and benchmark rather than on either in isolation. Recognition rarely leads to behavioral change, and when it does, the direction depends on the type of evaluation perceived. Models are also more sensitive to safety than capability evaluations, placing safety benchmark validity at greater risk. To study which factors each model is sensitive to and how they interact, we propose \textbfEvalAwareBench, a factor-controlled benchmark of 100 paired safety-capability tasks where each of the eight factors can be independently toggled, varying evaluative signals while holding the underlying request fixed. Through EvalAwareBench, we find that no single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across all of them. Our framework and EvalAwareBench provide the tools to measure, attribute, and mitigate evaluation awareness, pointing to behavioral consistency under recognition as a promising path forward.

[NLP-52] Model Collapse as Cultural Evolution CONLL2026

【速读】：该论文试图解决的问题是：生成式 AI（Generative AI）模型在自我训练过程中出现的“模型坍缩”（model collapse）现象缺乏语言学层面的解释，即哪些语言结构会退化、以何种顺序退化以及背后的机制是什么。解决方案的关键在于引入文化演化中的迭代学习理论（iterated learning theory），通过该理论推导出五个可证伪的预测，并在LLaMA-2-7B和Mistral-7B模型上进行跨语言（英语、德语、土耳其语）自训练实验验证。其中最具判别性的发现是：组合性（compositionality）在无过滤自训练下呈现非单调轨迹（先上升后下降），这一特征在使用高度规则的初始数据时依然存在（排除噪声消除因素），且仅在任务导向的过滤条件下得以维持，而非随机过滤，首次在大模型尺度上提供了压缩-通信权衡（compression-communication tradeoff）的实证证据。所有预测均得到强效应量支持（Hedges’ g = 1.6；BF₁₀ = 100），且模型正则化梯度与人类行为数据高度一致（R² = 0.94）。该研究将模型坍缩重新定义为一种文化传播现象，并为自训练流水线设计提供了可操作的原则。

链接: https://arxiv.org/abs/2605.23054
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong (香港大学); Stellaris AI Limited (Stellaris AI有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CoNLL 2026. 18 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Model collapse, the progressive degradation of LLMs trained on their own outputs, has been characterized statistically but lacks a linguistic explanation for which structures degrade, in what order, and why. We show that iterated learning theory from cultural evolution fills this gap. We derive five falsifiable predictions, distinguish those uniquely discriminative for the theory from confirmatory ones, and test them by self-training LLaMA-2-7B and Mistral-7B over 10 generations in English, German, and Turkish. The critical discriminative finding: compositionality follows a non-monotonic trajectory (initially rising, then falling) under unfiltered self-training. This signature persists with maximally regular seed data (ruling out noise removal) and is sustained only by task-grounded filtering, not random filtering, providing the first LLM-scale evidence for the compression-communication tradeoff. All predictions are confirmed with large effect sizes (Hedges’ g 1.6 ; \mathrmBF_10 100 ), and LLM regularization gradients closely match human behavioral data ( R^2 = 0.94 ). These results reframe model collapse as a cultural transmission phenomenon and yield concrete principles for self-training pipeline design.

[NLP-53] DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods ACL

【速读】：该论文旨在解决从社交媒体时间线中建模心理健康动态的复杂问题，具体针对CLPsych 2026共享任务中的三个子任务：心理状态建模、时间变化检测和序列级摘要生成。其解决方案的关键在于构建一个混合框架（hybrid framework），通过多模态方法融合不同技术优势：在任务1中结合大语言模型（LLM）数据增强、DeBERTa分类与随机森林回归实现结构化心理状态预测；在任务2中采用少量样本提示（few-shot prompting）与本地部署的Llama 3.1模型识别短期时间上下文中的“切换”（Switch）和“升级”（Escalation）事件；在任务3中引入基于检索增强生成（RAG）的方法，在改进（Improvement）类别的摘要中取得第一名成绩，证明其能有效捕捉跨时间线的心理变化模式。整体方案强调了多方法协同与评估指标一致性的重要性，同时揭示了分类与回归性能不匹配、时序转换建模困难及语义与相似性评价指标分歧等关键挑战，为未来统一评估体系的研究提供了方向。

链接: https://arxiv.org/abs/2605.23052
作者: Maryia Zhyrko,Daisy Monika Lal,Erik van Mulligen,Lifeng Han
机构: Leiden Institute of Advanced Computer Science (LIACS), Leiden University, NL; School of Computing and Communications (SCC), Lancaster University, UK; Department of Medical Informatics, Erasmus University Medical Center Rotterdam, NL; Biomedical Data Sciences, Leiden University Medical Center, NL
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by CLPsych2026. CLPsych 2026 will be held at ACL in San Diego July 4th, 2026

点击查看摘要

Abstract:We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence-level summarization. For Task 1, we combine LLM-based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction. For Task 2, we use few-shot prompting with a locally deployed Llama 3.1 model to detect Switch and Escalation events using short-term temporal context. For Task 3.1, we explore both a deterministic rule-based summarization pipeline and a few-shot LLM-based approach, ranking \textbf2nd officially. Our RAG-based method achieves strong performance in Task 3.2, ranking \textbf1st for Improvement and \textbf3rd for Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity-based evaluation metrics. These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks. We share our code and prompts at this https URL Comments: Accepted by CLPsych2026. CLPsych 2026 will be held at ACL in San Diego July 4th, 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.23052 [cs.CL] (or arXiv:2605.23052v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.23052 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-54] HawkesLLM : Semantic Uncertainty Propagation in Agent ic Text Simulation ICML2026

【速读】：该论文试图解决生成式文本模拟系统中因早期不确定性导致的路径依赖问题，即早期输出中的模糊性会持续影响后续步骤的生成质量。解决方案的关键在于提出HawkesLLM框架，该框架将时间影响建模与文本生成解耦：通过构建一个以文本生成代理（text-generating agents）为节点的网络，利用多变量霍克斯过程（multivariate Hawkes process）建模节点激活时序关系及前序输出对后续提示的影响；随后语言模型基于该时序模型筛选的紧凑记忆生成新事件。实验在GDELT新闻事件传播案例上验证了该方法，在有限提示记忆预算下显著提升了后期语义一致性，同时可区分局部漂移与全局漂移。

链接: https://arxiv.org/abs/2605.23043
作者: Zewei Deng,Tinghan Ye,Liyan Xie
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 10 pages, 4 figures, Accepted at the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

点击查看摘要

Abstract:Agentic text-simulation systems write in sequence, with each item becoming possible context for later steps. That makes uncertainty path-dependent: an early ambiguity can affect later outputs. This paper studies this problem with HawkesLLM, a framework that separates temporal influence modeling from text generation. We represent the cascade as a network whose nodes are text-generating agents. A multivariate Hawkes process models how these nodes activate over time and which earlier node outputs should influence later prompts. A language model then writes each new event from the compact memory selected by this temporal model. We evaluate the framework on a held-out Global Database of Events, Language, and Tone (GDELT) news-cascade case study. The diagnostics track semantic alignment with local held-out references and separate local drift from global drift. In this setting, HawkesLLM improves late-stage semantic alignment under a compact prompt-memory budget.

[NLP-55] Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLM s CONLL2026

【速读】：该论文试图解决的问题是：学习者如何在缺乏否定性证据（negative evidence）的情况下习得语言中不可接受结构的知识。传统理论难以解释这一现象，而生成式语法中的“统计抢占”（statistical preemption）机制提出，通过频繁接触某种常规形式（如 “donated the books to the library”），可以抑制其他结构上可能但未被验证的形式（如 “*donated the library the books”）。论文的关键解决方案在于：首次在大型语言模型（LLM）中通过统一且收敛的设计，直接区分统计抢占与竞争的“固化假说”（entrenchment hypothesis）。研究通过四个实验验证了以下核心发现：(1) LLM 的预期值（surprisal）与人类可接受性判断高度相关（r = 0.79），并得到三个独立行为数据集的支持；(2) 这种模式由竞争形式频率驱动而非动词整体频率，经非循环偏相关分析确认；(3) 抢占敏感性随模型规模呈幂律增长；(4) 通过受控微调干预，操纵竞争形式频率能按预测方向改变抢占行为，且反向控制排除了频率敏感性的混杂因素。这些结果为神经语言模型通过分布竞争（distributional competition）获取否定性语言知识提供了强收敛证据，印证了构式语法的核心主张。

链接: https://arxiv.org/abs/2605.23039
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong (香港大学); Stellaris AI Limited (星驰人工智能有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CoNLL 2026. 21 pages (9 main body + appendices and references); 4 figures, 14 tables

点击查看摘要

Abstract:How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., “donated the books to the library”) preempts structurally possible but unattested alternatives (“*donated the library the books”). We present a computational study that, for the first time, directly dissociates statistical preemption from the competing entrenchment hypothesis in large language models within a single converging design. Across four experiments spanning 120 English verb-construction pairings (dative, causative, locative), we show that (1) LLM surprisal patterns correlate strongly with human acceptability judgments ( r = 0.79 ), validated against three independent behavioral datasets; (2) these patterns are driven by competing-form frequency rather than overall verb frequency, confirmed by non-circular partial correlations; (3) preemption sensitivity scales as a power law with model size; and (4) a controlled fine-tuning intervention causally demonstrates that manipulating competing-form frequencies shifts preemption behavior in the predicted direction, with reverse-direction controls ruling out frequency-sensitivity confounds. These results provide converging evidence that neural language models acquire negative linguistic knowledge through distributional competition, the core mechanism posited by Construction Grammar.

[NLP-56] Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection ACL2026

【速读】：该论文旨在解决稀疏自编码器（Sparse Autoencoders, SAEs）在多语言场景下进行语言控制时的不可靠性问题，具体包括两个方面：一是现有SAE大多仅在英语数据上训练，导致跨语言表示能力弱；二是干预层的选择依赖启发式方法，缺乏理论依据。解决方案的关键在于提出一种基于机制的、可预测的多语言语言控制框架：首先，通过在多语言数据上训练SAE，显著增强跨语言表征并提升各层和模型家族中的语言控制稳定性与质量；其次，引入一个先验的干预层选择规则——基于多语言对齐度与语言可分性的交集，从而无需逐层搜索即可准确预测有效的干预深度。实验结果表明，该方法在LLaMA-3.1-8B和Gemma-2-9B模型上，在机器翻译和跨语言摘要任务中均能稳定优化语言识别准确率与生成质量之间的权衡，实现了从表征层面出发的多语言SAE控制的可解释性和可预测性。

链接: https://arxiv.org/abs/2605.23036
作者: Yusser Al Ghussin,Daniil Gurgurov,Tanja Baeumel,Josef van Genabith,Patrick Schramowski,Simon Ostermann
机构: Saarland University (萨尔兰大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); TU Darmstadt (达姆施塔特工业大学); hessian.AI (黑森AI); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: Accepted to TrustNLP Workshop at ACL 2026

点击查看摘要

Abstract:Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. Second, we introduce an \empha priori steering layer-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search. We evaluate our approach on LLaMA-3.1-8B and Gemma-2-9B across machine translation and cross-lingual summarization (CrossSumm), using SpBLEU, ROUGE-L, COMET, and LaSE. Our results show that multilingual SAEs combined with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality, providing a principled, predictive, representation-level account of multilingual SAE steering.

[NLP-57] Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography CONLL2026

【速读】：该论文试图解决的问题是：尽管中间层的大语言模型（LLMs）能最好地预测人类大脑对语言的响应，但其背后的机制尚不明确。解决方案的关键在于将稀疏自编码器（Sparse Autoencoders, SAEs）与神经编码模型相结合，从而对GPT-2 XL和Llama-3.1-8B模型进行可解释特征分解（每层16K–32K个特征），并基于人类验证的语义分类体系（κ ≥ 0.74）发现：仅靠语义特征即可恢复94%的峰值编码性能（r = 0.285），显著优于方差匹配基线（p < 0.001, d = 1.31）。进一步地，研究提出并验证了一个新颖的皮层拓扑预测：三个独立神经科学研究程序中先验定义的五个语义子类别应映射到不同的脑区；正式收敛检验证实了这一对应关系（Spearman ρ = 0.72, p < 0.001；超几何检验 p = 0.007），表明SAE发现的特征能够以先前方法无法达到的粒度再现已知的皮层语义组织结构。此外，SAE特征还能在控制词汇因素后预测人类阅读时间（ΔlogLik = 38.4, p < 0.001），且初步的预测误差分析提示大脑可能还编码了意外的语义内容。结果在英语、中文和法语中均具有泛化性。

链接: https://arxiv.org/abs/2605.23035
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong (香港大学); Stellaris AI Limited (Stellaris AI 有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Accepted at CoNLL 2026. 20 pages (9 main + 1 limitations/acknowledgments + 3 references + 7 appendix), 5 figures, 20 tables

点击查看摘要

Abstract:Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap by bridging sparse autoencoders (SAEs) from mechanistic interpretability with neural encoding models, decomposing GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. A human-validated taxonomy ( \kappa \geq 0.74 ) reveals that semantic features alone recover 94% of peak encoding performance ( r=0.285 ), substantially exceeding variance-matched baselines ( p0.001 , d=1.31 ). Beyond this aggregate dominance, we test a novel cortical topography prediction: five semantic subcategories derived a priori from three independent neuroscience programs should map onto distinct brain regions. A formal convergence test confirms this alignment (Spearman \rho=0.72 , p0.001 ; hypergeometric p=0.007 ), demonstrating that SAE-discovered features recapitulate known cortical semantic organization at a granularity inaccessible to prior methods. SAE features further predict human reading times beyond lexical controls ( \Delta\mathrmlogLik=38.4 , p0.001 ), and an exploratory prediction-error analysis provides preliminary evidence that the brain additionally encodes unexpected semantic content. Results generalize across English, Chinese, and French.

[NLP-58] Brain-LLM Alignment Tracks Training Data Not Typology CONLL2026

【速读】：该论文试图解决的问题是：脑-大语言模型（LLM）对齐是否能在不同语言间泛化，以及这种对齐模式的变异由什么因素决定。其解决方案的关键在于通过跨语言fMRI实验和多模型对比分析发现，训练语言主导性（而非英语本身固有特性）驱动了对齐模式的变化——一个以中文为主导的模型（Baichuan2-7B）在与中文大脑对齐方面表现最佳，而与英文大脑对齐最差；此外，语法类型学距离独立地影响对齐退化程度，且句法相关脑区（如额下回IFG）的梯度比语义区域（如颞顶联合区PTL）更陡峭（约2.3倍），同时分词丰度解释了约60%的跨语言最优编码层变化。这表明所谓“英语优势”实为训练数据组成的人工效应，而剩余变异则反映了真实语言类型学结构，尤其集中于句法处理过程。

链接: https://arxiv.org/abs/2605.23032
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: The University of Hong Kong (香港大学); Stellaris AI Limited (Stellaris AI 有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Accepted to CoNLL 2026. 9 pages main content + 4 pages references + 6 pages appendix; 4 figures, 13 tables

点击查看摘要

Abstract:Brain-LLM alignment is well established in English, yet the brain’s language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this using fMRI data from 112 participants across English, Chinese, and French (the Le Petit Prince corpus) and seven LLMs spanning English-dominant, Chinese-dominant, and multilingual architectures. Our central finding is that training-language dominance, not an inherent property of English, drives the alignment pattern: a Chinese-dominant model (Baichuan2-7B), architecture-matched to LLaMA-2-7B, reverses the gradient entirely, aligning best with Chinese brains and worst with English. Beyond training dominance, formal typological distance independently covaries with alignment degradation, syntax-associated brain regions (IFG) show 2.3\times steeper typological gradients than lexico-semantic regions (PTL), and tokenization fertility accounts for \sim 60% of a cross-linguistic shift in optimal encoding layer. These results reveal that the apparent “English advantage” in brain-LLM alignment is an artifact of training data composition, while the remaining variation reflects genuine typological structure concentrated in syntactic processing.

[NLP-59] RADAR: Relative Angular Divergence Across Representations

【速读】：该论文试图解决的问题是：在基础模型（foundation models）中，跨域迁移学习时常面临“负迁移”（negative transfer）现象——即引入额外数据源不仅未能提升下游任务性能，反而导致性能下降。传统方法依赖经验性扩展数据集，但缺乏对迁移可行性的有效评估机制。解决方案的关键在于提出RADAR（a geometrically grounded metric），该指标通过分析模型层间表征的几何演化特征来估计跨域迁移能力：具体而言，它测量层间位移轨迹中的角度对齐性和距离相对变化，并比较域内与跨域动态分布的差异。实验表明，RADAR在多模态任务（如跨语言情感分类和跨域图像分类）中表现出优于现有指标的预测性能，尤其在领域过渡平滑或清晰分离时效果显著，且其有效性高度依赖于模型内部表示空间的几何结构特性。

链接: https://arxiv.org/abs/2605.23028
作者: Xavier Cadet,Mateusz Nowak,Peter Chin
机构: Dartmouth College (达特茅斯学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages; 8 figures; 10 tables

点击查看摘要

Abstract:Machine learning methods rely on data. However, gathering suitable data can be challenging due to availability constraints, cost, or the need for domain expertise. Expanding datasets with additional sources is a common response to limited data, yet this practice does not always improve downstream performance and can sometimes lead to a loss of performance, known as negative transfer. We propose RADAR, a simple, geometrically grounded metric for estimating cross-domain transferability in foundation models. RADAR analyzes the layer-wise evolution of representations by measuring angular alignments and relative changes in distance along layer-to-layer displacement trajectories, and by comparing empirical distributions of within-domain and cross-domain dynamics. We hypothesize that domain transferability is related to the divergence between these trajectory distributions. We evaluate the metric across multiple modalities, including cross-lingual sentiment classification with text embedding models and cross-domain image classification with foundation vision models. Across several settings, RADAR provides competitive predictive performance relative to existing transferability metrics on several vision and text benchmarks, with particularly strong results when domain transitions are smooth or cleanly separated. Our ablations further suggest that the effectiveness of transferability estimation depends on the geometry of the model’s internal representation space, with different modalities favoring different topological formulations.

[NLP-60] he Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

【速读】：该论文试图解决的问题是：尽管大型语言模型（Large Language Models, LLMs）在软件编写、法律文书起草和临床记录生成等任务中展现出强大能力，但由图灵（Turing）、阿罗（Arrow）到无免费午餐定理（No Free Lunch theorems）等基础计算限制所定义的理论边界，如何转化为可操作的设计准则，从而指导可信人工智能（Trustworthy AI）的实际开发。其解决方案的关键在于将这些“不可能性结果”从抽象的理论现象转变为具体的、可计算的设计规则——核心突破是提出一个由网络架构本身决定的精度上限（Deterministic Horizon），该上限仅依赖于层数和嵌入维度，在训练数据量、适配器秩或损失函数变化时均不可逾越；该机制源于残差流（residual stream）的容量不变性，并通过信息论转换揭示了超过该阈值后准确率呈超指数级衰减的规律。这一方法还推广至偏好学习、多阶段检索管道、机制设计与零知识验证等多个子领域，形成一套包含十六项具体规范的“不可能性-规格化”框架，每项均包含可计算的边界、量化违反代价及构造性设计规则，为生成式人工智能研究提供了一种系统性的理论约束与工程指引。

链接: https://arxiv.org/abs/2605.23024
作者: Dongxin Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: PhD thesis, Department of Computer Science, The University of Hong Kong, 2026. 271 pages, 18 figures, 15 tables, 5 algorithms

点击查看摘要

Abstract:Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.

[NLP-61] A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism

【速读】：该论文旨在解决自闭症谱系障碍（Autism Spectrum Disorder, ASD）中社交语言障碍（Social Language Disorder, SLD）特征在自然对话中难以被识别的问题，这些特征如回声式重复、代词错位和刻板化媒体引用通常仅在特定会话条件下显现。传统临床评估依赖于医生的提问策略选择，而这一环节常被忽视，导致诊断信息获取效率低下。解决方案的关键在于提出一种名为TPA（Think, Plan, Ask）的主动多智能体对话框架，其中“医生代理”通过显式推理识别尚未观察到的SLD特征，并据此选择具有临床依据的提问策略生成针对性问题；同时，“患者代理”基于真实ADOS-2临床数据构建，确保评估可复现且具备高保真度。实验表明，TPA在35名患者的484个对话片段中实现了82.1%的SLD特征覆盖率，显著优于自动重放真实临床对话的基线方法（65.5%），并展现出更高的每轮诊断效率（AUCC: 0.628 vs. 0.458），证明了主动提问策略选择对提升自动化SLD评估效率的核心作用，为可扩展的AI辅助临床筛查提供了重要路径。

链接: https://arxiv.org/abs/2605.22993
作者: Chuanbo Hu,Minglei Yin,Bin Liu,Wenqi Li,Lynn K. Paul,Shuo Wang,Xin Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition, pronoun displacement, and stereotyped media quoting, are largely absent from spontaneous conversation and only emerge under specific conversational conditions. In structured clinical assessments, this latency means that questioning strategy selection is a critical yet underappreciated determinant of how much diagnostic information a conversation yields. Whether large language models (LLMs) can be guided to proactively select questioning strategies that systematically surface these latent traits remains largely unexplored. Here we present TPA (Think, Plan, Ask), a proactive multi-agent dialogue framework applied to the language assessment component of the Autism Diagnostic Observation Schedule Module 4 (ADOS-2), in which a doctor agent explicitly reasons about which traits remain unobserved before selecting a clinically grounded strategy and generating a targeted question. A patient agent grounded in real ADOS-2 clinical data enables reproducible evaluation without real patient participation, validated across three independent experiments confirming adequate fidelity to real patient language. Evaluated on 484 episodes from 35 patients, TPA outperforms six competitive dialogue planning baselines across all primary metrics, achieving 82.1% SLD trait coverage, 16.6% higher than automated replay of real clinical dialogues conducted by trained clinicians (65.5%), with substantially greater per-turn diagnostic efficiency (AUCC: 0.628 vs. 0.458, absolute gain +0.170). These results demonstrate that proactive questioning strategy selection substantially improves the efficiency of automated SLD trait assessment, with direct implications for scalable AI-assisted clinical screening.

[NLP-62] Memorization Dynamics of Fill-in-the-Middle Pretraining ICML2026

【速读】：该论文试图解决的问题是：填空式预训练（Fill-in-the-middle, FIM）目标对因果语言模型（causal language models）中verbatim memorization（原文记忆）的影响尚不明确，尤其是在控制条件下其记忆动态机制如何运作。解决方案的关键在于：通过在包含重复古腾堡文摘的FineWeb-Gutenberg语料上，对结构匹配的Llama 3.2模型分别使用FIM和标准左到右（LTR）预训练目标进行训练，并借助前缀引导的探测方法（prefix-based probes）系统分析两种策略下模型对短片段或完整连续文本的记忆能力差异。研究发现，FIM更倾向于恢复短或部分匹配的跨度，且其原文提取量随重复次数近似线性增长；更重要的是，即使在FIM格式的原生探针中，后缀上下文不足以支撑记忆，说明其仍强烈依赖前缀上下文，这揭示了FIM记忆行为的本质特征。此外，研究强调仅评估单一跨度长度或探测格式可能忽略关键的记忆行为细节。

链接: https://arxiv.org/abs/2605.22981
作者: Tobias von Arx,Tanguy Dieudonné
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: MemFM @ ICML 2026

点击查看摘要

Abstract:Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.

[NLP-63] A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text KR

【速读】：该论文旨在解决古雅典语（Katharevousa Greek）在当代自然语言处理（Natural Language Processing, NLP）流水线中支持不足的问题，尤其针对其在法律、行政和议会档案中的重要性。解决方案的关键在于构建一个可复现的端到端工作流，涵盖OCR感知的文本重建、基于模式约束的大语言模型（LLM）辅助标注、自动验证、确定性的CoNLL-U快照生成、固定分割评估以及不同模型家族的比较。该方法产出一个冻结且经自动验证的参考语料库（含1,697句，训练集1,357句，测试集340句），并在统一评分协议下对比了多种主流解析器（包括spaCy希腊语、mBERT、XLM-R及定制Stanza模型）。结果显示，尽管外部工具存在显著语域不匹配问题（最佳基线spaCy希腊语LAS仅为0.4183），XLM-R模型表现最优（LAS达0.5162，较基线提升0.0980），同时特征驱动模型在词性标注和依存关系识别上仍具竞争力，表明在有限数据规模下透明的词汇-上下文特征依然有效。此外，该研究还提供了一套可审计的方法论，将困难的历史议会OCR文本转化为可复用的句法NLP基础设施，并开源了完整代码、标注规范、固定分拆和模型基准报告。

链接: https://arxiv.org/abs/2605.22978
作者: George Mikros,Fotios Fitsilis
机构: Hamad Bin Khalifa University (哈马德·本·哈利法大学); Universidad Austral (奥尔塔大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 1 figure, 2 tables; companion to the kathnlp open-source release at this https URL

点击查看摘要

Abstract:Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece’s early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1,697 sentences, split into 1,357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol. Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0.4183 LAS. The best structural parser, an XLM-R model, reaches 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS, an absolute LAS gain of 0.0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale. Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline – code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports – is released as an open-access companion to this paper.

[NLP-64] When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance

【速读】： Model call failure

链接: https://arxiv.org/abs/2605.22975
作者: Brett Israelsen,Sheryl Carty,Josh Coates,Nancy Fulda,Julie Park,Pete Whiting
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 29 pages, 16 figures

点击查看摘要

Abstract:We ask whether large language models (LLMs) treat queries about religious conversion symmetrically. The answer is no. When asked for advice on hypothetical faith transitions from one religion to another, then asked the reversed question, models exhibited consistent asymmetries, favoring some religions while subtly discouraging conversion to others. On average Catholic, Bahá’í, and Sikh religions were broadly favored (high support for joining, low support for leaving), while Atheists, Agnostics, and Jehovah’s Witnesses were primarily disfavored. Patterns varied by model size and model provider, with Grok 4.20 exhibiting the strongest asymmetries. We tested 20 commercial and open-source language models across 182 religion pairings using a human-verified LLM-as-a-judge framework. Each model was probed via interactions with a simulated user asking for advice on a potential faith conversion. Models tended to use more encouraging language for some faith transitions over others; these patterns were systematically repeatable across multiple trials. All LLMs tested exhibited reproducible asymmetry, though the pattern of preferences differed for each. Overall preferences persist across multiple question phrasings and variations in the religious pairing dataset. Taken together, these results suggest that asymmetry is a robust property of model behavior rather than an artifact of how the models’ answers were scored. It is important to consider that any imbalances deployed and reproduced en masse can have real-world implications. Comments: 29 pages, 16 figures Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2605.22975 [cs.CL] (or arXiv:2605.22975v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.22975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-65] Graph Alignment Topology as an Inductive Bias for Grounding Detection

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在生成文本时缺乏事实准确性保障的问题，尤其是在临床决策支持等对事实正确性要求极高的领域。现有方法虽通过检索增强、自一致性或命题验证等方式提升事实性，但未直接利用参考信息与模型输出之间的对齐拓扑结构作为归纳偏置。解决方案的关键在于构建参考信息与LLM输出之间的对齐二部图（aligned bipartite graphs），并训练图神经网络（Graph Neural Network, GNN）通过消息传递机制建模该对齐结构，从而显式学习事实一致性。该方法在四个多样化幻觉检测和问答数据集上达到最优性能，显著优于包括GPT-4o在内的基线模型。

链接: https://arxiv.org/abs/2605.22963
作者: Paul Landes,Pranav Herur,Adam Cross,Jimeng Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables generalization, but it does not encode whether responses are grounded with respect to a reference. These issues limit the use of LLMs in domains where strict factual correctness is crucial, such as clinical decision support. Existing hallucination detection approaches improve factuality through retrieval augmentation, self-consistency, or claim verification, but generally do not learn directly over alignment topology. To leverage alignment topology as an inductive bias, we construct aligned bipartite graphs between reference information and LLM outputs and train a graph neural network (GNN) to model alignment structure using message passing. The method achieves state-of-the-art results on four diverse hallucination and question-answering datasets, outperforming all compared methods, including foundational LLMs such as GPT-4o.

[NLP-66] Learnability-Informed Fine-Tuning of Diffusion Language Models

【速读】：该论文旨在解决扩散语言模型（Diffusion Language Models, DLMs）在使用标准监督微调（Supervised Fine-Tuning, SFT）后推理能力提升受限的问题。现有研究表明，直接将SFT应用于DLMs可能不仅无效，甚至会损害性能，但其根本原因尚未被充分理解。论文通过分析发现，传统SFT忽略了“可学习性”（learnability），即模型在不同训练阶段对哪些token以及何时学习最为有效：当输入大部分被掩码时，稀有token难以被学习；而当输入几乎未被掩码时，常见token的学习价值较低。针对这一问题，作者提出LIFT算法，该方法在训练过程中动态调整策略——在输入掩码较多时优先学习容易的token，在上下文更丰富时则专注于学习困难的token，从而与扩散过程中的信息可用性相匹配。实验表明，LIFT在六个推理基准测试中均优于现有SFT基线，尤其在AIME’24和AIME’25上实现了高达3倍的相对性能提升。

链接: https://arxiv.org/abs/2605.22939
作者: Shubham Parashar,Atharv Chagi,Jacob Helwig,Lakshmi Jotsna,Sushil Vemuri,James Caverlee,Dileep Kalathil,Shuiwang Ji
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT-based post-training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME’24 and AIME’25. Our code is publicly available at this https URL.

[NLP-67] RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation

【速读】：该论文试图解决的问题是在结构化查询生成（如Text2Cypher任务中）中，如何高效分配推理阶段的计算资源以降低查询执行错误率。现有方法通常对生成的查询进行独立重采样（Independent Scaling, IS），忽略了数据库返回的语法错误信息；而论文提出的关键解决方案是Reflection-Augmented Scaling (RAS)，其核心在于利用执行反馈（即数据库返回的语法错误消息）通过上下文学习（in-context learning, ICL）来指导后续查询生成，从而形成一个基于反馈的迭代优化过程。实验表明，RAS在n=5时相比IS可将查询执行错误率降低41–50%，显著优于后者32–38%的表现，证明了将执行错误作为可操作反馈并结构化推理计算路径，比单纯增加独立样本数量更高效地提升查询可执行性。

链接: https://arxiv.org/abs/2605.22937
作者: Minseok Jung,Abhas Ricky,Muhammad Rameez Chatni
机构: Cloudera
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time scaling can reduce errors in structured query generation, but methods to allocate the compute for query code generation remains underexplored. We study Text2Cypher, where language models generate Cypher queries that execute against property graph databases. Non-executable queries constitute a distinct syntactic failure separate from semantic inaccuracy: a syntax error triggers a system-generated error message from the database. These error messages are typically discarded at inference time rather than leveraged through in-context learning (ICL). We compare two inference methods: Independent Scaling (IS), which performs memoryless resampling, and Reflection-Augmented Scaling (RAS), which conditions each new attempt on prior execution feedback via ICL. Across three Neo4j datasets and five code-specialized language models, RAS reduces the Query Execution Error Rate by 41–50% at n=5, outperforming IS at 32–38%. Execution errors are not merely failures to discard but actionable feedback, and structuring inference-time compute around them is a more efficient path to executability than scaling independent samples.

[NLP-68] EVE-Agent : Evidence-Verifiable Self-Evolving Agents

【速读】：该论文试图解决自演化搜索代理（self-evolving search agents）在无监督训练中因缺乏可验证证据而导致的不可靠性问题：即模型可能生成看似合理但无依据的答案，从而使得自生成的学习课程（curriculum）变得不透明且不可信。解决方案的关键在于引入“证据可验证性”（evidence verifiability）原则——每个训练样本必须包含一个与答案相关联的、可溯源的证据片段（verbatim evidence span），并通过一个证据验证器（evidence verifier）量化该片段对答案准确性的边际提升。EVE-Agent基于此原则改造了传统的 proposer–solver 框架，在不改变底层模型、检索器、搜索工具和优化框架的前提下，实现了无需人工标注或外部知识即可自动筛选高质量证据的闭环训练机制，显著提升了训练数据的证据锚定正确率，并确保整个学习过程具有可审计性（auditable）。

链接: https://arxiv.org/abs/2605.22905
作者: Yamato Arai,Yuma Ichikawa
机构: Fujitsu Limited; The University of Tokyo; RIKEN center for AIP
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 2 figures

点击查看摘要

Abstract:Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer–solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.

[NLP-69] Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision? CVPR2026

【速读】：该论文试图解决的问题是：当前视觉语言模型（VLMs）在基准测试中的准确率是否真正反映了其对视觉证据的依赖，即是否存在“表面准确但缺乏细粒度视觉理解”的现象。解决方案的关键在于通过多层次、多维度的系统性实验来揭示模型行为与标准准确率之间的不一致性，包括全局视觉退化、局部遮挡、问题重构、答案空间扩展及决策层面分析，并结合层间视觉 token 几何结构的分析，发现尽管模型输出预测未变，其内部对正确答案的支持已显著削弱，且深层网络中视觉 token 的表征趋同可能解释了这种对细粒度视觉信息的低敏感性。结果表明，现有基准测试无法可靠评估 VLMs 的细粒度视觉接地能力。

链接: https://arxiv.org/abs/2605.22903
作者: Zixuan Lan,Luzhe Sun,Matthew R. Walter,Jiawei Zhou
机构: University of Chicago; Toyota Technological Institute at Chicago; Stony Brook University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to GRAIL-V: Grounded Retrieval and Agentic Intelligence for Vision-Language, CVPR 2026 Workshop. accepted version

点击查看摘要

Abstract:Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model’s internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.

[NLP-70] ranscoders Trace Visual Grounding and Hallucinations in Vision-Language Models

【速读】：该论文试图解决生成式视觉语言模型（Generative Vision-Language Models, VLMs）中视觉输入如何转化为文本这一机制不清晰的问题，尤其是现有基于稀疏自编码器（Sparse Autoencoders, SAEs）的可解释性方法仅能分解静态残差表示，忽略了驱动跨模态交互的功能性更新。其解决方案的关键在于引入一种以功能为中心的框架，利用“转码器”（Transcoders）——即MLP子层的稀疏近似，作为逐层计算的因果代理，从而将模型分解为可解释的计算路径，实现从图像块到文本生成方向的映射。实验表明，转码器归因在图像补丁消融下对语义相关文本 token 的影响更强且更稳定，并与图像中语义区域高度对齐；进一步通过虚假视觉锚定反事实分析验证了这些路径的视觉-语言特异性。此外，基于转码器生成的电路轨迹提取结构化图特征，构建逻辑回归分类器可预测幻觉生成行为，AUC达0.68，证明该方法不仅能提供可解释性，还能预测模型行为。

链接: https://arxiv.org/abs/2605.22902
作者: Dimitrios Damianos,Leon Voukoutis,Georgios Skyrianos,Vassilis Katsouros,Georgios Paraskevopoulos
机构: Athena Research Center (阿斯纳研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which decompose static residual representations and miss the functional updates that drive cross-modal interaction. We adopt a function-centric framework based on Transcoders, sparse approximations of MLP sublayers that act as a causal proxy for layer-wise computation. Applied to Gemma 3-4B-IT, the framework decomposes the model into interpretable computational pathways linking image patches to directions in token generation. Transcoder attributions produce stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, and align better with semantically relevant image regions. A False Visual Grounding counterfactual analysis confirms that the recovered pathways are specific to vision-language this http URL, we perform a structural analysis of hallucinated generations, by extracting graph-based indicators from circuit traces produced by the transcoders. A logistic classifier over these mechanistic graph features predicts hallucinations at AUC 0.68 . These results show that function-centric circuit decomposition yields interpretable and predictive accounts of multimodal computation in VLMs.

[NLP-71] ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

【速读】：该论文试图解决形式化数学库中验证证明的可维护性问题以及神经定理证明器训练数据质量不足的问题。当前的证明优化面临目标异构、数据稀缺及训练与推理成本高昂等挑战，难以实现规模化应用。解决方案的关键在于提出ImProver 2——一个面向Lean 4的神经符号框架，其核心创新包括：一种数据高效的专家迭代流水线（expert-iteration pipeline），以及一个结合正式结构与轻量非形式抽象的“支架”（scaffold）机制；同时引入一套用于捕捉证明结构性质的指标体系。实验表明，基于该框架训练的7B参数模型在多项指标上优于同家族更大规模模型，并且在小模型和前沿模型上均显著提升性能，证明了通过适当架构设计和训练策略，小型模型也能有效重构复杂多样的研究级证明，从而将证明优化确立为一项可扩展、可学习的任务。

链接: https://arxiv.org/abs/2605.22885
作者: Riyaz Ahuja,Tate Rowney,Jeremy Avigad,Sean Welleck
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Formal mathematics libraries are rapidly expanding, creating a growing need to refactor verified proofs for maintainability and to improve training data quality for neural provers. However, scalable proof optimization is hindered by heterogeneous and heuristically specified objectives, scarce data, and high training and inference costs. To overcome these challenges, we introduce ImProver 2, a neurosymbolic framework for automated proof optimization in Lean 4. ImProver 2 combines a data-efficient expert-iteration pipeline with a scaffold that exposes formal structure alongside lightweight informal abstractions. We further introduce a suite of metrics capturing structural proof properties. Using ImProver 2, we train a 7B-parameter model that outperforms orders-of-magnitude larger models within the same model family, and is competitive with mid-tier frontier models across metrics. We additionally demonstrate that our neurosymbolic scaffold significantly improves performance across both small and frontier models. We show that with proper scaffolding and training, small models can effectively restructure research-level proofs over complex and varied metrics, matching substantially larger systems and establishing proof optimization as a scalable, learnable task.

[NLP-72] How Far Will They Go? Red-Teaming Online Influence with Large Language Models

【速读】：该论文试图解决的问题是：如何评估本地部署的开源大语言模型（LLM）在政治敏感话题上的表达能力及其被自然语言越狱攻击（jailbreak）后政治立场范围的变化，从而识别其在社交媒体环境中被用于政治影响力作战的风险。解决方案的关键在于提出了一种实证性的红队测试框架，用于量化LLM的“奥弗顿窗口”（Overton Window, OW），即模型能可靠生成的争议性政治观点范围，并通过系统性测试超过30个来自10个模型家族和5个国家的开源LLM，发现政治表达存在系统性不对称性（如偏向左翼）、模型规模与OW收缩呈负相关、地区差异显著，以及越狱技术的有效性因模型家族而异。这一框架为审计开源LLM的政治可控性提供了可操作的方法论基础，并推动了针对LLM驱动影响力作战的防御机制设计。

链接: https://arxiv.org/abs/2605.22880
作者: Daniel C. Ruiz,Anna Serbina,Ashwin Rao,Emilio Ferrara,Luca Luceri
机构: Information Sciences Institute (信息科学研究所); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 30 pages, 8 figures, submitted to COLM 2026

点击查看摘要

Abstract:As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.

[NLP-73] When Do LLM s Reason ? A Dynamical Systems View via Entropy Phase Transitions

【速读】：该论文试图解决的问题是：链式思维（Chain-of-thought, CoT）推理在大语言模型（Large Language Models, LLMs）中的应用存在显著的边际收益递减甚至负收益现象，尤其是在事实性任务和开放式任务中，其带来的性能提升往往无法抵消额外的token消耗。论文的核心贡献在于揭示了LLM推理并非任务或模型的静态属性，而是一个在生成过程中动态演化的解码状态——具体表现为早期解码阶段的熵（entropy）动态变化可作为推理状态的可靠信号。关键解决方案是提出了一种无需训练、轻量级的路由框架EDRM（Entropy Dynamics-based Reasoning Manifold），它通过将早期解码熵轨迹映射到一个紧凑且可解释的流形空间中，实现对推理策略的自适应选择：当熵呈现稳定下降趋势时，说明模型处于从高熵探索态向低熵结构化推理态的相变过程，此时启用CoT推理能有效提升性能；否则则跳过CoT以节省资源。实验证明，EDRM在15个基准测试和4种不同规模与架构的LLM上均优于静态基线，在数据集层面实现41–55%的token减少同时提升准确率，实例层面进一步提升准确率达4.7%并保持27–45%的token节省，从而验证了“按需推理”比默认启用CoT更高效且更具适应性。

链接: https://arxiv.org/abs/2605.22873
作者: Wei Xia,Haoqing Wang,Zhi-Hong Deng,Yehui Tang
机构: Samsung Research (三星研究院); Peking University (北京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open-ended tasks while multiplying token consumption. In this work, we show that LLM reasoning is not a static property of tasks or models, but a \emphdynamic decoding state that emerges during generation. Through systematic analysis, we find early-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns. This behavior can be interpreted as a phase-transition-like shift from a high-entropy exploratory regime to a low-entropy structured reasoning regime. Based on these insights, we propose \textbfEDRM (Entropy Dynamics-based Reasoning Manifold), a lightweight and training-free routing framework that leverages early decoding entropy to adaptively select inference strategies. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero-shot deployment and fine-grained instance-level adaptation. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines. At the dataset level, EDRM achieves \textbf41–55% token reduction while improving accuracy with as few as 50 calibration samples. At the instance level, it further improves accuracy by up to \textbf4.7% while maintaining \textbf27–45% token savings. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy-driven decoding control for efficient and adaptive LLM inference.

[NLP-74] he Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

【速读】：该论文试图解决的问题是：在小规模语言模型中，链式思维（Chain-of-thought, CoT）提示虽然能提升算术任务性能，但其关键贡献是否真的来自逻辑推理步骤的顺序？研究发现，CoT 的主要作用并非依赖于推理步骤的正确排序，而是存在一种“位置捷径”——模型倾向于直接复制推理过程中最后一个数字作为答案，无论中间过程是否正确。解决方案的关键在于通过前缀补全（prefix completion）分离出答案读取阶段，并识别出这一“位置捷径”机制：当答案分隔符前存在可复制的数值时，模型几乎总是选择该值作为输出（正确率高达95-96%），即使中间推理完全错误；而若移除该数字，模型准确率会显著下降（降至接近零），即便其本身具备基础算术能力。这种“复制通道”优先于上下文完整生成，且具有架构特异性（如Qwen、Llama和Gemma表现不同），揭示了当前CoT评估方法可能误将位置性答案传输当作真实计算，从而导致对模型推理能力的高估。

链接: https://arxiv.org/abs/2605.22870
作者: Ming Liu
机构: Amazon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages (8 main + 10 appendix), 3 figures, 5 tables

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance. What does CoT contribute if not logical sequencing? In three 1-3B instruction-tuned LMs on GSM8K, we isolate the answer-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning. Gold-answer presence accounts for 54-92 pp of accuracy (89-92% of each model’s teacher-forcing ceiling); even on incorrect items, the final answer matches the last CoT number 95-96% of the time. The copy channel takes precedence over retained-context completion: replacing the trailing number with a wrong value collapses accuracy to near-zero despite correct intermediates, yet removing it recovers 5-32 pp above that floor–even single-step arithmetic the model can otherwise perform is suppressed when a copyable number is present. Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively. Head-level ablation implicates architecture-specific head sets; the effect replicates on GSM-Symbolic. On non-arithmetic BBH tasks, shuffle retention drops sharply; at 7-8B, content-selective gating emerges. Step-level faithfulness evaluations risk conflating positional answer transport with genuine computation–a failure mode for CoT-based oversight.

[NLP-75] PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations

【速读】：该论文试图解决的问题是：在个性化定价谈判中，大型语言模型（LLM）代理虽能生成看似合理的对话行为并达成大量交易，但可能因无法准确识别买家隐含的支付意愿（willingness to pay）和议价特征而导致利润低下。解决方案的关键在于提出PrefBench——一个基于模拟器的基准测试平台，用于评估LLM代理在买家偏好隐藏条件下的定价决策能力。其核心创新在于设计了一个结构化的状态摘要协议（state-summary protocol），限制代理仅能基于公开信息（如买家人格描述、商品配置和谈判历史）做出严格JSON格式的动作响应，同时将买家的估值、耐心度、还价行为和退出决策等关键变量设为不可见的潜在变量（latent variables）。实验结果显示，尽管LLM代理在动作合规性和成交率上表现优异（>99%），但其平均利润远低于简单的让步启发式策略，揭示了当前LLM在缺乏利润敏感性建模时的局限性。

链接: https://arxiv.org/abs/2605.22855
作者: Yingjie Lei
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 3 figures, 5 tables. Code is available at this https URL

点击查看摘要

Abstract:Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator-based benchmark for hidden-preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle-customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter-offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM-facing state-summary protocol that constrains agents to return strict JSON actions under a fixed hidden-information boundary. We evaluate zero-shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller-profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement-seeking behavior can coexist with weak profit-sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing-agent behavior under hidden buyer preferences.

[NLP-76] A Survey of Text and Speech Resources for Hausa and Fongbe: Availability Quality and Gaps for NLP Development

【速读】：该论文旨在解决西非两种语言——豪萨语（Hausa）和丰贝语（Fongbe）在自然语言处理（Natural Language Processing, NLP）资源方面严重不均衡的问题。其核心目标是系统梳理当前公开可用的文本与语音资源，并明确二者在数据规模、领域覆盖、格式、许可协议及可访问性等方面的现状与差距。解决方案的关键在于通过广泛检索学术数据库、数据平台和网络来源，对平行语料库、单语文本集合、语音数据集、预训练模型及评测基准进行结构化整理与评估，从而揭示豪萨语因拥有新闻、百科和教育等多领域文本资源而具备更丰富的NLP支持，而丰贝语虽文本资源有限但近期在语音数据采集方面取得进展；同时指出两大语言均被纳入Masakhane基准任务（如命名实体识别NER和词性标注POS），并提出优先填补丰贝语跨领域文本资源与豪萨语专用语音语料库的空白。

链接: https://arxiv.org/abs/2605.22828
作者: Mahounan Pericles Adjovi,Victor Olufemi,Roald Eiselen,Prasenjit Mitra
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 7 tables; survey paper; to appear in IEEE SDS 2026

点击查看摘要

Abstract:This survey provides a comprehensive catalog of publicly available text and speech resources for two West African languages: Hausa, an Afroasiatic language with approximately 80-100 million speakers, and Fongbe, a Niger-Congo language spoken by approximately 2 million people in Benin. These languages represent contrasting cases on the resource availability spectrum. We address the question: \textitWhat is the current state of publicly available NLP resources for Hausa and Fongbe, and what gaps remain? Through systematic search of academic repositories, data platforms, and web sources, we catalog parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks. For each resource, we document size, domain coverage, format, licensing, and accessibility. Our findings reveal that Hausa benefits from broader text resource diversity across news, encyclopedic, and educational domains. Fongbe, while having more limited text resources, has been the focus of recent academic speech data collection initiatives. Both languages are represented in Masakhane benchmarks for NER and POS tagging. We provide task-specific recommendations and identify priority gaps including domain-diverse Fongbe text and dedicated Hausa speech corpora.

[NLP-77] Leverag ing Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentralands MANA Token

【速读】：该论文试图解决的问题是：如何通过融合社区情感信号与多模态金融数据来提升对虚拟世界经济中加密货币价格的预测准确性。其解决方案的关键在于构建一个融合了基于BERT的情感分析模型与长短期记忆网络（LSTM）的多模态预测框架，该框架不仅包含历史价格信息，还整合了来自Decentraland Discord社区的情感得分、交易量和市值等多维特征，从而显著优于仅依赖价格数据的基线模型，验证了社区语义信号在虚拟经济预测中的有效性。

链接: https://arxiv.org/abs/2605.20192
作者: Xintong Wu,Peiting Tsai,Jing Yuan,Michael Yu,Greg Sun,Luyao Zhang
机构: Duke Kunshan University (昆山杜克大学); Microsoft China (微软中国)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland’s Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.

信息检索

[IR-0] ubiFM: Unified Item Carousel and Search Ranking for Streaming Discovery

链接: https://arxiv.org/abs/2605.23702
作者: Alexandre Salle,Chenglei Niu,Suchismit Mahapatra,Xiaoxiao Chen,Suvash Sedhain,Yaqi Wang,Shervin Shahryari,Saurabh Agrawal,Qiang Chen,Michael Tamir
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalized discovery systems often train separate models for item ranking, carousel ranking, and search, even though these tasks expose complementary signals from the same viewer journey: watches shape carousel and item ranking, search queries reveal intent even when they do not lead to a catalog match, and watch history helps interpret search as rewatching, continuation, or new discovery. We introduce the user story, a serialized representation that turns a user’s cross-surface history - attributes, sessions, watch events with surface and carousel context, and search events - into a single token sequence. By interleaving pretrained language tokens with domain-specific event tokens, user stories let heterogeneous recommendation and search tasks be expressed as prompted next-token prediction over a shared grammar. TubiFM is one instantiation of this approach: a Llama 3.2 1B-based model trained on user stories and prompted to rank items, carousels, or search results without task-specific architectures. In offline evaluation, this single model outperforms specialist baselines across item, carousel, and search ranking. In online A/B tests, TubiFM significantly improves search total viewing time (TVT) by +3.9% and carousel TVT by +0.30% . Item ranking is statistically neutral on TVT ( +0.14% ), but matches a mature production stack; across all three tasks, TubiFM serves on L40S GPUs and reduces p99 ranking latency from 500ms to 200ms. These results show that shared user stories can improve discovery while simplifying ranking systems.

[IR-1] Synthetic Sources?: Auditing Generative Search Engine Citations for Evidence of AI-Generated Sources

链接: https://arxiv.org/abs/2605.23684
作者: Mowafak Allaham,Nicholas Diakopoulos
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
备注: 11 pages + Appendix

点击查看摘要

Abstract:The growing accessibility of Large Language Models via conversational interfaces capable of responding to users’ questions by drawing on, synthesizing, and citing information from the web (i.e., Generative Search Engines) has simplified the information-seeking process for users. However, with the proliferation of AI-generated content on the web, it is unclear whether these engines can reliably omit citing synthetic sources (i.e., AI-generated sources). Should these engines be unable to do so, this puts users at risk of harm by treating information from AI-generated sources synthesized in responses of generative search engines as equivalent to information from authoritative or official sources. In a step towards identifying whether AI-generated sources are being cited by these engines, this work presents an audit of four generative search engines (ChatGPT, Copilot, Gemini, Perplexity) using a total of 712 real-world human-generated queries spanning domains of public importance: politics, health, and the environment. Our findings show evidence of AI-generated sources being cited across all four generative search engines (~16% of cited sources) and identifies key source web domains these sources belong to that are frequently cited across these engines and topics. In addition, we observed that generative search engines include a somewhat narrow set of repeatedly cited domains while predominantly surfacing a large number of minimally cited domains in responses to users’ queries. These findings contribute to the growing body of work on assessing the risks of generative search engines with the objective of increasing public awareness of their limitations and encouraging appropriate measures to improve information quality and governance of these systems.

[IR-2] racking a Decade of Research at the University of Nigeria Nsukka: A Scientometric Analysis (2014-2023)

链接: https://arxiv.org/abs/2605.23586
作者: Muneer Ahmad,Joseph U Igligli
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 16 pages, 4 figures, Research Article

点击查看摘要

Abstract:This study employs scientometric methods to assess the research output and performance of the University of Nigeria from 2014 to 2023. By analyzing publication trends, citation patterns, and collaboration networks, the research aims to comprehensively evaluate the university’s research productivity, impact, and disciplinary focus. These research endeavors are characterized by innovation, interdisciplinary collaboration, and commitment to excellence, making the University of Nigeria a significant hub for cutting-edge research in Nigeria and beyond. The present study has been undertaken to determine the impact of the university’s research and publication trends from 2014 to 2023. The study focuses on year-wise research output, citation impact at local and global levels, prominent authors and their total output, top journals, collaborating countries, and the most contributing departments of the University of Nigeria. The university’s ten years of publication data indicate that 6,353 papers were published from 2014 to 2023, receiving 86,202 citations with an h-index of 39. In addition to this, the stenographical mapping of data is presented through graphs using the VOSviewer software mapping technique. The findings of this study will contribute to understanding the university’s research strengths, weaknesses, and potential areas for improvement. Additionally, the results will inform evidence-based decision-making for enhancing research strategies and policies at the University of Nigeria

[IR-3] HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

链接: https://arxiv.org/abs/2605.23572
作者: Vipul Gupta,Shikhar Mohan,Lakshya Kumar,Pranjal Chitale,Nikit Begwani,Amit Singh,Manik Varma
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 10 tables

点击查看摘要

Abstract:In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference (“teacher”) retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever’s precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.

[IR-4] Is Dimensionality a Barrier for Retrieval Models?

链接: https://arxiv.org/abs/2605.23556
作者: Kiril Bangachev,Guy Bresler,Jonathan Kogan,Yury Polyanskiy
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Combinatorics (math.CO)
备注:

点击查看摘要

Abstract:Why does the low dimensionality of representations, typically d\approx 1000 , not prevent modern embedding-based retrieval models from scaling to billions, or even trillions, of data points? To answer this question, we study maximal-margin embeddings in the following retrieval model, classically studied in communication complexity [PS86] and more recently in embedding-based retrieval [WBNL26]. Let A\in \0,1^N\times n be a matrix indicating whether each of N queries is relevant to each of n documents. We are interested in the largest margin m0, denoted by \mathsfm^\mathsfrd(d, A), for which there exist unit norm embeddings of the queries and documents \U_j_j = 1^N, \V_i_i = 1^n with the following property. \langle U_j, V_i\rangle \ge m whenever A_ji = 1 and \langle U_j, V_i\rangle \le -m otherwise. A large margin is a key proxy for representation quality: it controls both robustness to perturbations and compositional generalization across queries. Our main theorem establishes that the best possible margin without a restriction on the dimension, \mathsfm^\mathsfrd(+\infty, A), can be nearly achieved in dimension d = O(\mathsfm^\mathsfrd(+\infty, A)^-2\log n) which improves a theorem of [BDES02]. Together with a matching lower bound in Theorem 1.5, we conclude that when A\in \0,1^\binomnk\times n is the matrix containing all possible k -sparse rows once, dimension d = O(k\log (n/k)) is necessary and sufficient for the maximal possible margin \mathsfm^\mathsfrd(+\infty, A) = \Theta(k^-1/2) in this setting. This fully resolves the setup of [WBNL26]. We also give several constructions for large margins when d = o(k\log (n/k)). Finally, we empirically test the InfoNCE and sigmoid losses for producing large margin embeddings and demonstrate a clear advantage of the sigmoid loss.

[IR-5] PMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization

链接: https://arxiv.org/abs/2605.23398
作者: Lingling Fu,Yongfu Xu
类目: Information Retrieval (cs.IR)
备注: 11 pages,6 figures

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous iteration is repeatedly used as the reference model for subsequent rounds, noise in preference data and errors in the reference model accumulate over time. This accumulation can lead to late-stage over-optimization, performance fluctuations, and degraded generalization. To address these issues, we propose TPMM-DPO, a trajectory-aware preference-guided model merging method. The method treats the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrates them using learned fusion weights, thereby constructing a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward scores on both in-domain and out-of-domain evaluations. Further ablation studies and robustness analyses demonstrate that, compared with simple averaging, learnable-weight fusion more effectively alleviates late-stage performance degradation caused by noisy preferences. Comments: 11 pages,6 figures Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2605.23398 [cs.IR] (or arXiv:2605.23398v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.23398 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-6] owards Generalizable and Efficient Large-Scale Generative Recommenders

链接: https://arxiv.org/abs/2605.23312
作者: Qiuling Xu,Ko-Jen Hsiao,Moumita Bhattacharya
类目: Information Retrieval (cs.IR)
备注: first published under netflix tech blog this https URL

点击查看摘要

Abstract:Generative recommendation models can model user behavior as sequences of events and provide a shared backbone for multiple recommendation tasks. In production, however, pre-training gains do not automatically translate into downstream application improvements: task headroom, repeated-training cost, serving latency, and item freshness all affect transfer. We describe our experience scaling a generative recommender from 2M to 1B backbone parameters, excluding embedding and decoding layers, in a production-scale title recommendation setting. Across multiple downstream tasks, we observe task-dependent scaling behavior: some tasks approach an empirical ceiling within the observed scale range, while others continue to benefit from additional capacity. This motivates using offset scaling-law fits as a diagnostic for where additional model scale may be more or less useful. We then study production constraints that arise when applying the model in practice. Frequent retraining over trillions of behavior tokens makes training and decoding efficiency important; cached serving can make the immediate next-token target stale; and newly launched titles may need to be scored from semantic metadata before collaborative ID embeddings are reliable. We address these issues with multi-token prediction for serving-latency alignment, sampled softmax and a projected decoding head for efficient repeated training, and semantic item towers with collaborative-embedding masking for cold-start adaptation. In a one-week production-shadow evaluation over 1M users, the 1B-backbone model achieves higher MRR than the 2M-backbone baseline across all reported tasks. Overall, the results support treating model scale as one component of a production transfer problem, alongside task headroom, decoding cost, serving-latency alignment, and item generalization. Comments: first published under netflix tech blog this https URL Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2605.23312 [cs.IR] (or arXiv:2605.23312v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.23312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-7] From Head to Tail: Asymmetric Knowledge Transfer in Long-tail Recommendation with Generative Semantic IDs

链接: https://arxiv.org/abs/2605.23310
作者: Chenyi Yan,Ruocong Tang,Xing Fang,Yang Huang,He Guo,Jing Wang
类目: Information Retrieval (cs.IR)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Long-tail recommendation in real-world e-commerce platforms remains challenging due to severe data imbalance. Existing methods often struggle to combine content-based multimodal features with collaborative signals. Many of these methods also ignore an important asymmetry in knowledge transfer between head and tail IDs: noisy signals from tail IDs can hurt representation learning for head IDs. This paper presents AKT-Rec, a framework for Asymmetric Knowledge Transfer in long-tail Recommendation that uses LLM-generated semantic IDs. AKT-Rec uses Multimodal LLMs (MLLMs) with supervised fine-tuning to align content representations with collaborative information for both items and users, producing semantic representations. It then discretizes these representations into semantic IDs with a Residual-Quantized VAE (RQ-VAE), which yields semantic clusters of similar entities. AKT-Rec has two main components: (1) Cluster-Guided Adaptive Embedding, which decomposes each ID representation into a cluster-level embedding that captures shared semantics and an individual embedding. Through an asymmetric contrastive objective and an activity-aware gating mechanism, this module directs knowledge transfer from head to tail IDs. (2) Hierarchical Feature Aggregation, which builds parallel feature views and adaptively fuses them to optimize predictions for samples with varying activity levels. Extensive experiments on a large-scale industrial dataset and online A/B testing on the Alibaba Tmall platform demonstrate the effectiveness of AKT-Rec. AKT-Rec improves offline performance by 0.35% in AUC and 1.53% in GAUC, outperforming several competitive baselines. In online A/B testing, AKT-Rec achieves a 2.76% increase in CTR and a 3.47% increase in GMV, validating its utility in real-world production environments.

[IR-8] Expand More Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation KDD2026 KDD

链接: https://arxiv.org/abs/2605.23191
作者: Guoming Li,Shangyu Zhang,Junwei Pan,Wentao Ning,Jin Chen,Gengsheng Xue,Chao Zhou,Shudong Huang,Haijie Gu,Menglin Yang
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Numerical Analysis (math.NA)
备注: Accepted at the 32st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Research Track), KDD 2026 February Cycle

点击查看摘要

Abstract:Scaling recommendation models is a central challenge in recommender systems. Recently, RankMixer has emerged as an effective solution, operating on a unified token representation and alternating between token mixing and per-token feedforward networks (P-FFNs) to achieve scalable performance. However, RankMixer suffers from \textitembedding collapse, where learned representations have low effective rank, limiting expressivity and underutilizing the expanded representation space. Through empirical analysis and theoretical insights, we identify rigid token mixing and P-FFN modules as the primary causes of this phenomenon, jointly inducing a \textbfdamped oscillatory trajectory in effective-rank evolution across layers. To address it, we propose RankElastor, a novel architecture that produces spectrum-robust representations with provable collapse mitigation. RankElastor introduces two components: (i) \textbfparameterized full mixing, which enables expressive token mixing with improved spectral robustness; and (ii) \textbfGLU-improved P-FFNs, which stabilize representation spectra through GLU-style FFN modules. Extensive experiments on large-scale industrial datasets demonstrate that RankElastor consistently improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior. Code is available at this GitHub repository: this https URL

[IR-9] Building a privacy-preserving Federated Recommender system for mobile devices

链接: https://arxiv.org/abs/2605.22924
作者: Aasheesh Singh
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注: this http URL . thesis, Université de Montréal, Department of Computer Science and Operations Research, 2024

点击查看摘要

Abstract:Serving personalized content on mobile devices has traditionally required pooling sensitive user data on centralized servers, a practice increasingly at odds with modern privacy expectations and geographical regulations. We present a two-stage federated recommendation system pipeline for mobile devices, built around a principled separation between non-sensitive user preference data and sensitive mobile context data that never leaves the device. The first stage runs a collaborative filtering model on non-sensitive app-context data in the cloud to generate a shortlist of relevant items. The second stage re-ranks these candidates on-device using sensitive mobile signals, with only model updates/gradients ever leaving the device. We validate the approach on MovieLens, UCI Human Activity Recognition, and a proprietary pilot dataset, and deliver a production-ready implementation as a Kotlin Multiplatform library deployable on Android and iOS.

[IR-10] AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2605.22923
作者: Tom Verhoeff
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 19 pages, 3 figures

点击查看摘要

Abstract:Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.

[IR-11] SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research

链接: https://arxiv.org/abs/2605.22878
作者: Shuofei Qiao,Yunxiang Wei,Jiazheng Fan,Bin Wu,Busheng Zhang,Mengru Wang,Yuqi Zhu,Ningyu Zhang,Keyan Ding,Qiang Zhang,Huajun Chen
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Ongoing Work

点击查看摘要

Abstract:The exponential growth of global academic output has confronted researchers and AI agents with an unprecedented information explosion,'' where fragmented and unstructured knowledge organization impedes deep interdisciplinary integration. Current academic retrieval tools predominantly rely on superficial keyword matching or vector-space semantic retrieval, which lack the topological reasoning capabilities required to navigate complex logical connections. Agentic deep-research-based frameworks are often prone to logical hallucinations and consuming high inference costs. To bridge this gap, in this report, we introduce SciAtlas, a large-scale, multi-disciplinary, heterogeneous academic resource knowledge graph designed as a panoramic scientific evolution network. By integrating over 43M papers from 26 disciplines, and a total of 157M entities and 3B triplets, SciAtlas provides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective. Furthermore, we develop a neuro-symbolic retrieval algorithm featuring tri-path collaborative recall and graph reranking, achieving a seamless transition from simple semantic matching to deterministic association discovery. We also present key application directions of SciAtlas, including literature review, automated research trend synthesis, idea positioning, and academic trajectory exploration, to demonstrate that SciAtlas can serve as an effective cognitive map’’ to empower the full loop of automated scientific research while significantly reducing reasoning costs. We have released the interfaces for KG retrieval and various downstream tasks in our GitHub repo.

[IR-12] Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model

链接: https://arxiv.org/abs/2605.22843
作者: Tianhao Qiu,Xiaojun Chen
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 17ages, 5 figures

点击查看摘要

Abstract:Text-to-SQL converts natural language questions into executable SQL queries, enabling non-technical users to access relational databases for analytics and intelligent data services. In real-world scenarios, performance is often constrained by low-resource settings, where high-quality annotated \textttquestion, SQL pairs are scarce, particularly for domain-specific databases. Additional challenges include opaque schema definitions, abbreviations, and implicit business logic that are not explicitly encoded in the schema. Existing data synthesis and prompting techniques improve coverage but often fail to produce task-specific, semantically grounded examples aligned with database constraints. To address these challenges, we propose a knowledge-aware Text-to-SQL framework that constructs task-specific knowledge base including schema semantics, abbreviations, business logic, and query patterns, and injects them into both training and inference. This framework generates diverse, contextually grounded synthetic training data and enhances inference through targeted knowledge retrieval. Experiments on seven benchmarks, covering both general and domain-specific datasets, demonstrate that our approach substantially improves the performance of open-source and closed-source large language models in Text-to-SQL tasks, especially in low-resource domain-specific settings, enhancing generalization, robustness, and adaptability.

[IR-13] Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

链接: https://arxiv.org/abs/2605.22834
作者: Mudit Rastogi
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a precision-recall trade-off unresolvable by tuning chunk size alone. Semantic and agentic methods partially address these limitations but do not integrate user queries at the chunking stage. We present Query-Adaptive Semantic Chunking (QASC), which dynamically constructs chunks by integrating queries into segmentation through three mechanisms: cosine similarity scoring between sentence and query embeddings to identify seed sentences, contextual window expansion around seeds to preserve coherence, and chunk-level score aggregation to ensure holistic relevance. We evaluate QASC on 100 technical documents across 200 queries spanning four types, comparing against fixed chunking at five granularities, recursive splitting, semantic chunking, and agentic chunking. QASC achieves an F1-score of 0.85, a relative improvement of 18-27% over fixed chunking and 8-12% over semantic and agentic alternatives. Ablation studies confirm each component contributes meaningfully. Human evaluation by three annotators (Cohen kappa = 0.82) corroborates that QASC produces more relevant and coherent chunks than existing methods.

[IR-14] RAG 4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis

链接: https://arxiv.org/abs/2605.22833
作者: Daqian Shi,Pei Han,Jishizhan Chen,Yang Wang,Xiaolei Diao,Xianyou Zheng,Pengfei Cheng
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery trajectories. Traditional assessment often relies on manual scoring systems, which limit scalability, efficiency, and consistency in clinical practice. Furthermore, the heterogeneous nature of clinical data poses challenges for current multimodal learning approaches that require aligned inputs and large annotated datasets. In this work, we propose RAG4Outcome, a retrieval-augmented generation (RAG) framework for prognostic prediction in chronic osteomyelitis. Our method integrates multimodal clinical data, including PET-CT imaging reports, structured surgical and diagnostic records, and unstructured follow-up notes, into a unified prediction pipeline. By combining a domain-specific retrieval corpus with expert-guided prompting, the framework enables more interpretable, evidence-grounded, and clinically reliable prognosis. Preliminary results on real-world cases demonstrate promising effectiveness and clinical alignment, highlighting the potential of RAG4Outcome for AI-assisted infection management and postoperative decision support.

[IR-15] LFRAG : Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding

链接: https://arxiv.org/abs/2605.22829
作者: Yifan Zhu,Yu Mi,Yue Lu,Yanchu Guan,Zhixuan Chu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing multimodal RAG systems predominantly rely on coarse-grained page-level retrieval, which fails to capture fine-grained semantic and layout structures in visually rich documents, thereby compromising retrieval accuracy and leading to redundant context in downstream tasks. To address these issues, we propose Layout-oriented Fine-grained Retrieval-Augmented Generation (LFRAG), a novel framework that advances multimodal RAG from page-level to block-level retrieval. We perform layout segmentation to construct semantically coherent fine-grained retrieval units and design a semantic-layout fusion encoder that integrates local semantics with global context via cross-attention. With block-level late interaction retrieval, LFRAG enables precise query-content alignment and reduces irrelevant content for downstream generation. To enable rigorous evaluation, we construct LFDocQA, a large-scale benchmark with block-level annotations spanning diverse document types, designed to assess both multimodal document retrieval and question answering with greater granularity than existing datasets. Extensive experiments on LFDocQA demonstrate that LFRAG achieves state-of-the-art performance on retrieval tasks, outperforms the best baseline by 7.20% in answer accuracy, and reduces token consumption by 73.07% in generation tasks, confirming LFRAG as an accurate and efficient framework for multimodal RAG over visually rich documents. Our code and datasets will be released soon.

人机交互

[HC-0] Divergent Paths to Depolarization: Dialogue Design Determines the Prosocial Benefits of AI-Assisted Political Argumentation

链接: https://arxiv.org/abs/2605.23890
作者: Jianlong Zhu,Syed Muhammad Jhon Raza Naqvi,Carolin-Theresa Ziemer,Usman Naseem,Ingmar Weber
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Argumentative dialogues across political divides can reduce polarization, yet opportunities for citizens to engage with opposing views in accessible and structured ways remain limited. AI dialogue partners offer a scalable framework for such open-mindedness exercises, but how the format of human-AI dialogues shapes their benefits remains unclear. In a two-session online experiment, 469 US participants were assigned to argue either for or against their own attitude on a contested political issue with an AI chatbot. Our experimental findings show attitude-congruent dialogues produced greater immediate reduction in both affective and opinion polarization than attitude-incongruent dialogues. By contrast, attitude-incongruent dialogues elicited weaker cognitive state empathy than the non-AI reference task but increased cognitive trait empathy in the two-week period between sessions, suggesting the effects of active generation of attitude-incongruent arguments may emerge over time. These findings highlight dialogue design as a key determinant of effective AI-mediated behavioral interventions.

[HC-1] Human Decision-Making with Persuasive and Narrative LLM Explanations

链接: https://arxiv.org/abs/2605.23867
作者: Laura R. Marusich,Mary Grace Kozuch Dhooghe,Jonathan Z. Bakdash,Murat Kantarcioglu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fairly accurate predictions, but also in their ability to generate cogent narrative explanations of those predictions. Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions; however, less is known about the impact of narrative explanations on objective human decision-making performance. Here we conduct a large-scale human behavioral experiment to evaluate decision-making performance with LLM-generated narrative explanations of varying persuasiveness. We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance. We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect. Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction. Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making.

[HC-2] “I cant read your mind”: A Study of Neurodivergent Computing Students Experiences with Collaborative Active Learning

链接: https://arxiv.org/abs/2605.23823
作者: Cynthia Zastudil,Srishty Muthusekaran,Rayhona Nasimova,Stephen MacNeil
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Computing courses often feature active learning techniques that promote collaboration and social interaction between students. However, neurodivergent students’ preferences and experiences with these techniques are not well understood. We conducted a survey of neurodivergent computing students (n=24), specifically autistic students or students with ADHD, and neurotypical computing students (n=20) to understand how the structure of collaborative active learning affects their comfort in computing courses. We also interviewed four computing students on the autism spectrum or with ADHD to gain more contextualized insights into their experiences and accessibility recommendations. Our survey surfaces how team dynamics and assignment structure can impact neurodivergent students’ comfort in computing courses. Neurodivergent students expressed discomfort with assignments that lack structure or have ambiguous expectations. Neurodivergent students prefer smaller teams that work together frequently with explicitly defined roles. Our interviews identified ways that neurodivergent students cope with discomfort in collaborative active learning, including self-selecting roles and self-disclosure. While preliminary, our results highlight how instructors can design collaborative active learning to be more equitable and accessible for neurodivergent students.

[HC-3] Perceptually Lossless Tactile Texture Synthesis with Compact Spectral Envelope Models

链接: https://arxiv.org/abs/2605.23804
作者: Jagan K. Balasubramanian,Yasemin Vardar
类目: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注: 16 pages and 8 figures

点击查看摘要

Abstract:Modern audio-visual media rely on compact representations for efficient storage and transmission, whereas realistic digital touch still depends on high-resolution tactile recordings. Existing approaches for representing tactile signals constrain manipulation and limit the generation of new content. Here, we introduce two compact representations, spectral beta and spectral slope, that capture the temporal spectral structure of finger-surface friction signals while preserving perceptually relevant information. Spectral beta models spectral skewness using a two-parameter beta distribution, whereas spectral slope approximates the spectrum with an asymmetric bandpass filter defined by low- and high-pass orders. We evaluated these representations in a perceptual study with 14 participants using five virtual textures rendered on a friction-modulation display and compared them with physical textures and high-fidelity reproductions of recorded signals. Spectral beta achieved perceptual similarity ratings comparable to those of the original high-fidelity reproductions. Regression analysis further showed that matching spectral energy across nine critical frequency bands was the strongest predictor of perceived realism. Together, these findings suggest that tactile texture perception depends primarily on fundamental temporal spectral patterns and that modeling these patterns is sufficient for perceptually realistic rendering. These results establish an efficient and scalable framework for haptic compression, communication, and synthetic texture generation.

[HC-4] Engagement-Optimized Care: When LLM s become Mental Health Infrastructure

链接: https://arxiv.org/abs/2605.23787
作者: Briana Vecchione,Meryl Ye,Livia Garofalo,Ranjit Singh
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:General-purpose LLMs are increasingly functioning as mental health infrastructure due to gaps in care left by provider shortages, inadequate insurance coverage, social isolation, and stigma around formal help-seeking. This shift poses a distinct problem for AI ethics: systems neither designed nor governed as care technologies are being used as such, while their dominant design incentives optimize for engagement rather than user well-being. We present findings from a qualitative, longitudinal study with 18 US-based participants who use general-purpose LLMs for socioemotional support and participated in one or more of our study phases, including initial interviews, a four-week diary study, focus groups, and exit interviews. Participants turned to LLMs because other forms of support were unavailable, unaffordable, socially costly, or inadequate. As they continued to use these systems, design features such as anthropomorphic cues, default validation, persistent responsiveness, and weak disengagement mechanisms shaped their ongoing reliance. Participants described meaningful support alongside dependency, epistemic distortion through one-sided validation, privacy expectations without corresponding legal protection, and continued use despite awareness of these risks. We argue these dynamics reflect a structurally unfair tradeoff: users accept risks because support is otherwise absent, while available systems are optimized to deepen engagement and lack care-based accountability. The paper makes three contributions: it traces the arc through which LLMs become care infrastructure and identifies distinct ethical tensions at each stage, shifts analysis from turn-based exchanges to longitudinal trajectories of use, and argues that accountability belongs at the design and incentive conditions through which these systems become care infrastructure rather than at the output or crisis-response layer.

[HC-5] AI at the Front Lines of Platform Governance: Using LLM s to Support Illegal Content Reporting under the Digital Services Act

链接: https://arxiv.org/abs/2605.23676
作者: Marie-Therese Sekwenz,Shreyan Biswas,Rita Hermann-Gsenger,Ujwal Gadiraju
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Illegal content reporting mechanisms are a key technical and organizational measure through which online platforms address illegal content under the European Union Digital Services Act (DSA). Article 16 requires user notices to be sufficiently substantiated and submitted in good faith, placing users in the difficult position of interpreting legal and procedural language and translating ambiguous content into legally meaningful categories and reasons. We investigate how large language model (LLM)-based assistants can support this reporting process. In a controlled user study (N = 450) using an interface modeled on a major platform reporting workflow, we compare three conditions: unaided reporting, a conventional explainable AI assistant (XAI) that suggests a single legal category with a rationale, and an evaluative AI assistant (EvalAI) that presents balanced pro and con arguments across candidate legal provisions. We further examine these assistance forms under systematically varied AI error regimes. Our results show that EvalAI improves provision-level accuracy under AI error and reduces misclassification distance relative to conventional XAI, particularly for near-miss and overbreadth errors. When AI output is correct, conventional XAI enables faster decisions, but neither AI assistance form reliably improves the quality of users’ substantiated explanations relative to unaided reporting. We discuss design implications for compliance-oriented reporting interfaces, highlighting trade-offs between accuracy, deliberation, explanation quality, and vulnerability to misleading AI output.

[HC-6] Detecting Drunk Driving Using Off-the-Shelf Smartwatches

链接: https://arxiv.org/abs/2605.23663
作者: Robin Deuber,Lanlan Yang,Michal Bechny,Christoph Heck,Matthias Pfäffli,Matthias Bantle,Florian von Wangenheim,Elgar Fleisch,Wolfgang Weinmann,Manuel Günther,Felix Wortmann,Varun Mishra
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 27 pages, 7 figures

点击查看摘要

Abstract:Alcohol-impaired driving remains a major yet preventable cause of road traffic injury and death, with many drivers underestimating their level of intoxication. Compared to in-vehicle systems, mobile drunk-driving detection using consumer smartwatches offers a scalable way to trigger preventive interventions and increase awareness without additional in-vehicle hardware. We introduce a system that leverages wrist accelerometer data and heart rate variability-derived physiological signals to detect alcohol-related driving impairment. We collected data in a randomized, controlled three-arm test-track study (n=54) and trained both logistic regression models with window-aggregated features and a two-tower 1D convolutional neural network (CNN), to detect alcohol-impaired driving. The CNN achieved a participant-averaged area under the receiver operating characteristic (AUROC) of 0.88 for detecting any alcohol intoxication and 0.86 for detecting driving above the WHO-recommended limit of 0.05 g/dL. To the best of our knowledge, this is the first work to (1) demonstrate drunk-driving detection using consumer smartwatches, (2) develop and evaluate such a system in a real vehicle on a closed test track, and (3) rigorously assess generalization to unseen participants. Together, these findings highlight the potential of wearable-based sensing to support scalable, measurement-driven prevention of alcohol-related traffic harm.

[HC-7] When Youth Enter the Algorithmic Wild: Discovering and Understanding Potentially Harmful Teen Videos on Douyin and Kwai

链接: https://arxiv.org/abs/2605.23598
作者: Shaoxuan Zhou,Yafei Sun,Jing Zhang,Xianghang Mi
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Short-video platforms like Douyin and Kwai have become central to adolescent digital life, but they also risk exposing teens to algorithmically amplified harmful content. Despite its societal importance, the scale, mechanisms, and real-world impact of this exposure remain poorly understood. Measuring it is challenging: recommendation feeds are personalized black boxes, harmful content employs sophisticated evasion tactics, and naive crawlers fail to replicate authentic teen behavior. To bridge this gap, we propose PHTV-Scout, the first large-scale, behaviorally grounded measurement framework for Potentially Harmful Teen Videos (PHTVs). We integrate an offline survey of 683 adolescents with a tri-module online pipeline: (1) PHTV Hunter simulates teen accounts to collect recommendation feeds; (2) PHTV Arbiter, a LoRA-finetuned multimodal classifier, detects PHTVs with 94.29% accuracy and 96.41% precision; and (3) PHTV Analyzer performs fine-grained categorization and impact assessment. Over six months, we analyzed 186,727 videos and 51,287 comments, uncovering a troubling 6.11% PHTV prevalence–dominated by Child Sexual Exploitation Imagery (53.2%)–and revealing that harmful content thrives through covert interactions (e.g., grooming comments, self-disclosure) and active evasion (semantic camouflage, noise injection). Crucially, while Youth Mode blocks 100% of PHTVs, its low adoption (30-41%) leaves most teens unprotected. We further show that exposure is driven not by user identity but by regulation, platform algorithms, and even passive browsing, exposing the fragility of adolescent information environments. Our findings call for a paradigm shift from reactive takedowns to proactive, human-centered safeguards.

[HC-8] MindCopilot: Towards Formalizing and Evaluating Granular Human-LLM Co-Writing IJCAI2026

链接: https://arxiv.org/abs/2605.23535
作者: Youqing Fang(1 and 2),Yinhao Tang(1 and 2),Yanan Sun(2),Jiangning Liu(2),Ziyi Wang(2),Xun Zhao(2),Bin Liu(1),Weiming Zhang(1),Kuikun Liu(2),Wenwei Zhang(2),Kai Chen(2) ((1) University of Science and Technology of China, (2) Shanghai AI Laboratory)
类目: Human-Computer Interaction (cs.HC)
备注: 30 pages, 8 figures. Accepted to IJCAI 2026

点击查看摘要

Abstract:Recent writing assistants are increasingly shifting from passive, prompt-driven interaction to proactive, suggestion-based completion, which integrates localized continuations into the writing flow and reduces coordination burden. However, existing evaluations simply focus on output quality, failing to capture how users accept, edit, or repair suggestions in real-time interaction, and thus obscuring the true usability of proactive co-writing systems. To address this gap, we adopt a sequential, behavior-centered view of interactive writing and formalize co-writing as a Human-in-the-Loop Markov Decision Process, modeling writing as an interaction shaped by user acceptance and editing decisions. Based on this formulation, we introduce the Co-Writing Fidelity Suite, an interaction-aware metric suite that captures both user-assistant alignment and cognitive editing effort, including Hierarchical Acceptance Rate and Knowledge-aware Editing Distance. We conduct a large-scale simulation study across 16 writing domains, using 1,688 controlled continuation queries sampled from different writing stages. Our analysis reveals systematic effects of interaction structure on acceptance behavior and editing cost. A follow-up user study with 30 participants confirms that these behavioral patterns align with real user experience. Together, our findings demonstrate that interaction-aware evaluation provides insights beyond output-only metrics and informs the design of more effective proactive writing assistants.

[HC-9] Socially fluent AI decouples conversational signals from source identity in online interaction

链接: https://arxiv.org/abs/2605.23426
作者: Lixiang Yan,Yueqiao Jin,Xibin Han,Dragan Gašević
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Socially fluent agentic AI can now participate in online interaction in ways that resemble ordinary human conversation, potentially weakening people’s ability to infer who is human from conversational signals alone. We tested this possibility in synchronous text-based group interaction by embedding undisclosed AI agents as ordinary teammates across analytical, creative, and ethical tasks. Across 786 participants who made 1,572 post-interaction identity judgments, people did not distinguish AI from human teammates above chance. This failure did not arise because the interaction lacked identity-relevant information. Conversational behaviour contained robust cues that differentiated AI from humans and supported highly accurate computational classification. Instead, participants relied on familiar suspicion heuristics, including response speed, fluency, and perceived scriptedness, that were only weakly related to actual identity. Representational analyses further showed that judgments were organised around subjective impressions rather than the behavioural structure encoding ground truth. This dissociation creates new vulnerabilities to coordinated AI agents that can influence and manipulate online discourse at scale.

[HC-10] Cogniscope: A Synthetic Longitudinal Benchmark and Browser-Based Evaluation Framework for Early-Risk Cognitive AI Systems

链接: https://arxiv.org/abs/2605.23242
作者: Mahfuza Farooque,Ananya Drishti,Mukhil Muruganantham Prakaash,Uttkarsh Agarwal,Zahra Abdul Basit,Asish Kondragunta
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present Cogniscope, an open evaluation framework for studying longitudinal early-risk AI systems under controlled behavioral drift, sparse observations, delayed evidence, and heterogeneous progression patterns. Cogniscope combines two complementary components: a synthetic simulation engine that generates privacy-preserving longitudinal behavioral traces aligned with configurable latent risk trajectories, and a browser-based data-collection instrument implemented as a Chrome extension for capturing naturalistic video interaction telemetry and micro-question responses during YouTube playback. The released benchmark includes 200,000 simulated video-interaction records from 200 users over 200 days, a 504-session schema-aligned synthetic deployment dataset across nine behavioral profiles, an 18-table relational schema, baseline evaluation scripts, and time-aware metrics including Early Risk Detection Error (ERDE) and time-to-detection (TTD). We emphasize that Cogniscope is not a diagnostic system and does not claim clinical validity. Instead, it provides a reusable testbed for evaluating how sequential models behave under known longitudinal challenges before deployment with real human-subject data. Experiments show that simple behavioral coherence signals separate simulated risk states under controlled priors, while rule-based deployment-profile classification remains challenging, motivating learned temporal models and robust evaluation protocols.

[HC-11] Cognitive offloading and the speedup illusion in human-AI interaction

链接: https://arxiv.org/abs/2605.23177
作者: Sunny Yu,Myra Cheng,Ahmad Jabbar,Ilia Sucholutsky,Katherine M. Collins,Dan Jurafsky,Robert D. Hawkins
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Proceedings of the 48th Annual Meeting of the Cognitive Science Society

点击查看摘要

Abstract:Large language models (LLMs) have the potential to boost human productivity by speeding up task completion – provided users know when to offload cognitive work to them. But we do not know if users are well-calibrated in estimating these potential time savings. We conducted a preregistered large-scale behavioral study (N = 1237) to characterize mismatches between expectations and reality, with a focus on simple cognitive tasks. While actual completion times between independent completion and AI-assisted completion did not differ, participants predicted AI to be significantly faster. The same bias was not observed when imagining help from another human participant. We identify a speedup illusion where people have accurate forecasts of independent completion times but significantly underestimate AI-assisted times. Additionally, time and effort dissociate: participants reported lower subjective effort with AI despite equivalent completion times. This suggests that completion time itself is not sufficient to characterize efficiency gains.

[HC-12] From Preventive to Reactive: How AI Coding Assistants Transform Developers Security Awareness

链接: https://arxiv.org/abs/2605.23130
作者: Faisal Haque Bappy,Tahrim Hossain,Sidratul Muntaher Meheraj,Annoor Sharara Akhand,Tasfia Tabassum,Tarannum Shaila Zaman,Raiful Hasan,Tariqul Islam
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)
备注: This paper has been accepted at the 2026 Symposium on Usable Privacy and Security (SOUPS)

点击查看摘要

Abstract:AI coding assistants are now central to professional software development, yet their impact on how developers think about and practice security remains poorly understood. While prior work has documented vulnerability rates in AI-generated code, a more fundamental question persists: how do these tools transform security awareness in authentic, ongoing development practice? We conducted semi-structured interviews with 15 professional software engineers and observed them completing security-relevant coding tasks with AI assistance, spanning 3 experience cohorts defined by their relationship to AI tools during professional formation. We find that AI coding assistants reorganize rather than eliminate security thinking, shifting it from the act of writing code to the act of reviewing it. This transition from preventive to reactive security is structurally encouraged by interaction models that frame code generation as a functional task, leaving security as an afterthought. Notably, none of our coding session participants specified security requirements in their initial prompts, even when they possessed the relevant knowledge, revealing a decoupling of security awareness from security behavior. We further document informal coping strategies developers had independently invented to manage AI security risk, none of which are supported by current tools or organizations, and find that the experience cohort did not reliably predict security performance. This paper contributes a practice-grounded account of how AI-assisted development reshapes the human side of secure coding, offering empirical foundations for the design of more security-aware tools, training programs, and organizational policies.

[HC-13] Defining AI Fatigue in Academic Contexts: Dimensions Indicators and a Stage-Based Model Using Grounded Theory

链接: https://arxiv.org/abs/2605.23123
作者: John Paul P. Miranda,Emmanuel B. Parreño,Jovita G. Rivera
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 17 pages, journal article, Volume 25, Issue 5,

点击查看摘要

Abstract:The integration of AI tools in academic settings has introduced a distinct form of strain that existing frameworks like technostress and digital fatigue have not yet fully addressed. This study develops a conceptual model and identifies the dimensions that define AI fatigue as a form of strain arising from sustained academic use of AI tools. Using grounded theory analysis of open-ended responses from 1,054 university students across three universities in the Philippines, the study examined the cognitive, motivational, emotional, physical, and attentional pressures students experienced during AI-supported academic work. Analysis produced five dimensions of AI fatigue, namely Cognitive Overload, Motivational Disengagement, Moral Unease, Physical Strain, and Attentional Drift, each consisting of two indicators grounded in participant accounts. The findings also yielded the AI Fatigue Model, a stage-based framework that explains how these pressures accumulate and reinforce one another across repeated AI interaction in academic tasks. These contributions establish a conceptual and exploratory foundation for AI fatigue as a distinct construct and provide a basis for future instrument validation, scale development, and cross-contextual inquiry in academic settings where AI now mediates student learning.

[HC-14] Remind Me To Check The Stove Before I Leave The House: Authoring Personalized Context-Aware Smart Home Reminders Using Everyday Language

链接: https://arxiv.org/abs/2605.23085
作者: Reina Szeyi Chan,Sujendra Jayant Gharat,Maya Lampi,Yueran Jia,Avi K Srinivasan,Xiang Zhi Tan
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Reminder systems commonly rely on fixed schedules, location triggers, or simple rules, limiting their ability to leverage the rich sensing capabilities of modern smart homes. A key challenge lies in enabling users to specify context-aware reminders without requiring complex configurations. We present a system pipeline that supports reminder authoring through natural language and conversational interaction. The pipeline translates user requests into structured representations and executable logic, incorporating time-based, activity-based, sensor-based, and state-based conditions. We conducted two studies to examine how users express reminder intent and how conversational support influences the authoring process. In Study 1 (N=40), we analyzed 233 user-authored reminders and identified challenges in expressing reminders with diverse and complex logic. Based on these findings, we refined the system and evaluated it in Study 2 (N=10), demonstrating improved handling of time-based, activity-based, sensor-based, and state-based conditions. Our results highlight the diversity and ambiguity of user expressions and show that conversational guidance can help structure these expressions into flexible, context-aware reminders.

[HC-15] StanBKT: Rethinking Parameter Estimation in Bayesian Knowledge Tracing

链接: https://arxiv.org/abs/2605.23048
作者: Siddhartha Pradhan,Yanping Pei,Morgan Lee,Puyuan Zhang,Erin Ottmar,Adam C. Sales
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Applications (stat.AP); Methodology (stat.ME)
备注: 5 figures, 7 tables

点击查看摘要

Abstract:Bayesian Knowledge Tracing (BKT) is a widely used and interpretable student modeling approach in intelligent tutoring systems and educational data mining. However, most implementations rely on expectation-maximization or related optimization methods that yield only point estimates, limiting uncertainty quantification and principled comparisons across learners and conditions. We introduce StanBKT, an open-source Python package for estimating BKT models using Bayesian inference in Stan. StanBKT provides a unified framework supporting Hamiltonian Monte Carlo, variational inference, Pathfinder, and optimization-based estimation while preserving the hidden Markov structure and interpretability of classical BKT. It supports standard, grouped, and hierarchical BKT models, flexible prior specification, posterior predictive inference, and utilities for visualization and diagnostics. We evaluate StanBKT on large-scale observational and controlled educational datasets. On the ASSISTments 2020 dataset, we show that supported inference methods achieve comparable predictive performance while differing in computational efficiency and posterior fidelity. We further demonstrate how posterior inference enables principled comparison of condition-specific parameters in an educational intervention involving perceptual cue manipulations. Results illustrate how uncertainty quantification facilitates more reliable interpretation of differences in learning, forgetting, guessing, and slipping parameters across experimental conditions. Overall, StanBKT extends BKT beyond point estimation by providing a flexible framework for probabilistic student modeling, uncertainty quantification, and hierarchical inference in educational data mining.

[HC-16] Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations

链接: https://arxiv.org/abs/2605.22986
作者: Helena Merker,Nick Walker,Andreea Bobu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features – or task-relevant aspects of behavior. In practice, demonstrations are often imperfect: humans may under-emphasize certain features due to cognitive load or physical difficulty, or the training regime may fail to sufficiently cover all relevant situations. In either case, important features may be underspecified, leading to ambiguity in the learned reward function and misaligned behavior at deployment. We propose a framework that detects such underspecified features and actively solicits targeted corrective demonstrations. Our key insight is that demonstrations implicitly reveal which features are well specified: features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely. We leverage this statistical signal to infer which features may have been insufficiently demonstrated. The robot then explains which features it is uncertain about in natural language and queries for demonstrations that explicitly address the identified gaps. We evaluate our approach in a simulated tabletop manipulation domain and in a user study with a real Franka robot. Targeted, explanation-guided queries significantly improve reward recovery compared to random querying and passive data collection, reducing ambiguity that would otherwise persist in learning from imperfect demonstrations.

[HC-17] Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

链接: https://arxiv.org/abs/2605.22971
作者: Ko Watanabe,Shoya Ishimaru
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Employees often struggle to identify ``who knows what,‘’ leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.

[HC-18] GazeBehavior Annotation Toolkit (GBAT): AI-powered toolkit for automatic annotation of egocentric eye-tracking and video data of child-caregiver interaction

链接: https://arxiv.org/abs/2605.22962
作者: Iba Baig,Kevin Li,Yanbin Xu,Seiji Cattelain,Marie Hallo,Hayato Ono,Sho Tsuji,Ming Bo Cai
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE); Neurons and Cognition (q-bio.NC)
备注: submitted to IEEE International Conference on Development and Learning (ICDL), 2026

点击查看摘要

Abstract:Video recordings of child-caregiver interactions enable investigation of attentional dynamics during naturalistic behavior. Such multimodal recording also allows researchers to examine how attention interacts with action and language use in real time. However, manual annotation of such data is time-consuming. Here, we introduce GazeBehavior Annotation Toolkit, a deep-learning-based toolkit designed to facilitate three key processes in data preprocessing and feature extraction: post-hoc synchronization across multiple videos, semi-automatic annotation of gaze target categories, and categorization of participants’ poses and hand actions. This toolkit improves the efficiency and scalability of feature extraction from human egocentric eye-tracking and video data. Such improvement is critical in supporting large-scale and longitudinal investigations of attentional dynamics and naturalistic behavior in human early development.

计算机视觉

[CV-0] Geo-Align: Video Generation Alignment via Metric Geometry Reward

链接: https://arxiv.org/abs/2605.23903
作者: Zizun Li,Haoyu Guo,Runzhe Teng,Chunhua Shen,Tong He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.

[CV-1] PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

链接: https://arxiv.org/abs/2605.23902
作者: Yifan Lu,Qi Wu,Jay Zhangjie Wu,Zian Wang,Huan Ling,Sanja Fidler,Xuanchi Ren
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes 4\times and even 8\times upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of 512 \times 512 images into 2048 \times 2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6\times faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.

[CV-2] From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain

链接: https://arxiv.org/abs/2605.23895
作者: Yuval Golbari,Navve Wasserman,Matias Cosarinsky,Roman Beliy,Aude Oliva,Antonio Torralba,Michal Irani,Tamar Rott Shaham
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Identifying which brain regions represent a visual concept in the human brain is a central challenge in neuroscience. Existing approaches have localized coarse functional regions (e.g., faces, places) through activation maximization, identifying regions that activate strongly for a target concept relative to other concepts. Yet strong activation alone does not establish that a region represents the concept itself, as responses may instead be driven by correlated visual or semantic cues. We introduce BrainCause, an automated framework that combines generative and brain models to synthesize controlled stimuli and validate neural representations through targeted causal testing. Given a query specifying a concept of interest, our framework constructs targeted stimulus sets comprising concept images, counterfactual edits that remove the target concept while preserving other image content, and images with candidate correlated distractors. It then uses an image-to-fMRI encoding model to predict brain responses and searches for representations that respond specifically to the target concept over correlated alternatives. BrainCause returns validated candidate representations and proposes follow-up fMRI experiments to further test or extend its discoveries. Our approach successfully recovers known functional localizations and identifies new candidate representations across dozens of concepts, validated on both predicted and measured fMRI data. Critically, we show that without causal validation, a large fraction of localizations would be false positives, confirming that activation alone is insufficient evidence of representation.

[CV-3] Good Token Hunting: A Hitchhikers Guide to Token Selection for Visual Geometry Transformers

链接: https://arxiv.org/abs/2605.23892
作者: Shuhong Zheng,Michael Oechsle,Erik Sandström,Marie-Julie Rakotosaona,Federico Tombari,Igor Gilitschenski
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Page: this https URL , Code: this https URL

点击查看摘要

Abstract:Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at this https URL.

[CV-4] Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework

链接: https://arxiv.org/abs/2605.23891
作者: Xiao Cao,Yansong Qu,Xiangzhen,Chang,Wen Xiao,Jiakui Hu,Heyuan Li,Jialun Liu,Zhiyong Huang,Xuelong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mask-free video object insertion has emerged as a challenging task, requiring harmonious integration of reference objects into source videos. However, existing methods struggle when references exhibit severe stylistic domain gaps with the source scene. To overcome this, we propose \textit\textbfSmart-Insertion-V, an end-to-end \textbfDual-Stream framework that concurrently conducts video insertion and image style transfer. Within this framework, the image stream synchronously guides the video generation process, while a \textbfClosed-loop Feedback mechanism is further incorporated to ensure robust insertion. Inevitably, integrating these diverse conditioning signals results in feature entanglement and style leakage. To tackle this issue, we design \textbfDual-World-View RoPE to distinguish different signals via spatial-temporal offsets without incurring heavy training overhead. Furthermore, to facilitate spatial grounding and stylistic adaptation, we introduce a \textbfDecoupled Guidance Module that leverages a Vision-Language Model for semantic reasoning while preserving original temporal guidance with native text encoder. To bridge data gap for harmonious reference insertion task, we propose a data curation pipeline and will release an \textbfopen-source dataset. Experiments demonstrate that our method can insert objects into plausible positions while achieving the most harmonious results.

[CV-5] HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

链接: https://arxiv.org/abs/2605.23889
作者: Chong Cheng,Peilin Tao,Nanjie Yao,Guanzhi Ding,Xianda Chen,Yuansen Du,Xiaoyang Guo,Wei Yin,Weiqiang Ren,Qian Zhang,Zhengqing Chen,Hao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emphevidence influence kernel and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: this https URL

[CV-6] GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

链接: https://arxiv.org/abs/2605.23888
作者: Katharina Schmid,Nicolas von Lützow,Jozef Hladký,Angela Dai,Matthias Nießner
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models – we use Trellis.2 as an example – which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.

[CV-7] PGT: Procedurally Generated Tasks for improving visual grounding in MLLM s

链接: https://arxiv.org/abs/2605.23883
作者: Rim Assouel,Amir Bar,Michal Drozdzal,Adriana Romero-Soriano
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What’sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What’sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.

[CV-8] LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

链接: https://arxiv.org/abs/2605.23878
作者: Bo Jiang,Depu Meng,Yihan Hu,Yichen Xie,Tianshuo Xu,Wei Zhan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Modern video generators produce visually compelling clips but still struggle with physical and motion consistency, limiting their use as reliable world simulators. Existing remedies often rely on external simulators, teacher models, or curated physics-focused data. We explore a complementary self-supervised direction: extracting motion cues from the unlabeled videos already used to train video diffusion models. We propose LaMo, which formulates a latent motion prior over frame-to-frame latent changes conditioned on the current latent and prompt. This prior is exposed through two lightweight readouts: a macro motion drift used during training as a Motion Drift Loss, and a learned micro motion field used during sampling as Motion Prior Guidance. Both components are plug-and-play with existing video diffusion backbones, requiring no architectural or I/O changes. On VideoPhy and VideoPhy2, LaMo improves CogVideoX backbones and outperforms recent physics-aware baselines that use external supervision. On VBench, it preserves overall generation quality while improving motion-related dimensions. These results suggest that unlabeled video contains useful motion supervision for improving physical fidelity in modern video diffusion models.

[CV-9] Vision Transformers Need Better Token Interaction

链接: https://arxiv.org/abs/2605.23868
作者: Linxiang Su
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emphsemantic diffusion: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality alone: shallow features can remain better aligned with foreground regions yet underperform deeper features, and \texttt[CLS] features remain complementary for dense prediction. These observations suggest that the goal should not be to remove global context, but to make token interactions more selective. We therefore study sparse attention as a minimal intervention, replacing softmax attention with entmax-1.5 while preserving global token connectivity. On DINOv1 ViT-S/16 trained for 200 epochs on ImageNet-1K, this change preserves ImageNet linear probing accuracy and substantially improves semantic segmentation performance: VOC mIoU increases from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. These results suggest that selective token mixing is a simple and effective bias for improving dense ViT representations.

[CV-10] Leverag ing Foundation Models for Causal Generative Modeling

链接: https://arxiv.org/abs/2605.23861
作者: Aneesh Komanduri,Xintao Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.

[CV-11] Learning a Particle Dynamics Model with Real-world Videos CVPR2026

链接: https://arxiv.org/abs/2605.23845
作者: Chanho Kim,Suhas V. Sumukh,Li Fuxin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings

点击查看摘要

Abstract:Data-driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non-rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real-world settings. This reliance on synthetic data can limit their applicability when the sim-to-real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real-world videos. Specifically, we propose to learn a particle-based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real-world videos without requiring particle-level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real-world dataset consisting of about 500 videos capturing diverse object interactions.

[CV-12] MuellerPT: Decomposition Driven Pretraining for Dense Learning in Mueller Polarimetry MICCAI2026

链接: https://arxiv.org/abs/2605.23840
作者: Adam Tlemsani,Yingdian Li,Maxime Giot,Naim Slim,Christopher J. Peters,Abhijeet Ghosh,Daniel S. Elson
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)

点击查看摘要

Abstract:Mueller matrix imaging provides rich, physically meaningful contrast for biomedical tissue analysis, but supervised learning is hindered by scarce dense annotations and strong domain shifts across specimens and acquisition settings. We introduce MuellerPT, a physics guided pre-training approach that learns transferable dense representations by predicting Lu-Chipman decomposition maps from per-pixel 4x4 Mueller matrices. To scale pre-training, we collected a new large Multispectral Animal Polarimetric Organ dataset (MAP-Org). The pre-trained encoder is adapted with a segmentation head for grey vs. white matter segmentation in lamb brain. A classification head is used for colorectal cancer vs. non-cancer classification. Both segmentation and classification are evaluated across few-shot learning scenarios. In segmentation, MuellerPT improves label efficiency and cross specimen transfer compared to models without pre-training, achieving an absolute DICE gain of over 20% compared to the baseline trained from scratch when using 5% of the training data. In classification, MuellerPT also enhances label efficiency, improving overall accuracy by 8% compared to the baseline when using 1% of the training data. We demonstrate MuellerPT’s robustness to domain shift with a qualitative evaluation of its predicted Lu-Chipman maps on an ex vivo human oesophagus sample. These results suggest that predicting Lu-Chipman decomposition is an effective and practical pretext task for robust biomedical inference from Mueller polarimetry and can pave the way for future work on label efficient Mueller imaging.

[CV-13] Not Too Generative Not Too Discriminative: The Human Alignment Sweet Spot

链接: https://arxiv.org/abs/2605.23819
作者: Jorge Chang Ortega,Bastien Le Lan,Thomas Serre,Victor Boutin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.

[CV-14] Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models KDD2026

链接: https://arxiv.org/abs/2605.23797
作者: Bo Peng,Jie Lu,Guangquan Zhang,Zhen Fang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: KDD 2026

点击查看摘要

Abstract:Aiming at identifying unexpected inputs from unknown classes, out-of-distribution (OOD) detection has emerged as a pivotal approach to enhancing the reliability of machine learning models. This paper focuses on the burgeoning paradigm of post-hoc OOD detection with pre-trained vision-language models (VLMs), where a popular pipeline is to detect OOD inputs by examining their affinities between ID labels and negative labels, i.e., those semantically different from ID labels. Due to the unavailability of target OOD labels, existing works predominantly rely on heuristic rules to mine negative labels from unlabeled wild corpus data. Despite the empirical success, we argue that the power of VLM-based OOD detection has yet to be fully unleashed since the notorious false negative problem is far from addressed in the literature. With this motivation, we are interested in addressing the challenge of mining true negative labels for OOD scoring. To this end, we develop a theoretical framework for correcting the sampling bias of negatives labels by indirectly approximating the distribution of negative labels. Perhaps surprisingly, we show that the debiased negative mining can be naturally converted into Monte-Carlo sampling based on ID labels and the unlabeled wild corpus data. Extensive experiments empirically manifest that our method establishes a new state-of-the-art in a variety of OOD detection setups. Code is publicly available at \hrefthis https URL\textcolorredhere.

[CV-15] Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model

链接: https://arxiv.org/abs/2605.23790
作者: Romaric Mazna,Jean Martinet,Sai Deepesh Pokala
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.

[CV-16] Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset

链接: https://arxiv.org/abs/2605.23777
作者: FB Pena,D Crabi,Sandro C Izidoro,Érick O Rodrigues,G Bernardes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The grading of gemstones is currently a manual procedure performed by gemologists. A popular approach uses reference stones, where those are visually inspected by specialists that decide which one of the available reference stone is the most similar to the inspected stone. This procedure is very subjective as different specialists may end up with different grading choices. This work proposes a complete framework that entails the image acquisition and goes up to the final stone categorization. The proposal is able to automate the entire process apart from including the stone in the created chamber for the image acquisition. It discards the subjective decisions made by specialists. This is the first work to propose a machine learning approach coupled with image processing techniques for emerald grading. The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach. Furthermore, we also create and publish the used dataset that contains 192 images of emerald stones along with their extracted and pre-processed features.

[CV-17] A Novel Approach for the Counting of Wood Logs Using cGANs and Image Processing Techniques

链接: https://arxiv.org/abs/2605.23775
作者: João VC Mazzochin,Giovani Bernardes Vitor,Gustavo Tiecker,Elioenai MF Diniz,Gilson A Oliveira,Marcelo Trentin,Érick O Rodrigues
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study tackles the challenge of precise wood log counting, where applications of the proposed methodology can span from automated approaches for materials management, surveillance, and safety science to wood traffic monitoring, wood volume estimation, and others. We introduce an approach leveraging Conditional Generative Adversarial Networks (cGANs) for eucalyptus log segmentation in images, incorporating specialized image processing techniques to handle noise and intersections, coupled with the Connected Components Algorithm for efficient counting. To support this research, we created and made publicly available a comprehensive database of 466 images containing approximately 13,048 eucalyptus logs, which served for both training and validation purposes. Our method demonstrated robust performance, achieving an average Accuracy_pixel of 96.4% and Accuracy_logs of 92.3%, with additional measures such as F1 scores ranging from 0.879 to 0.933 and IoU values between 0.784 and 0.875, further validating its effectiveness. The implementation proves to be efficient with an average processing time of 0.713s per image on an NVIDIA T4 GPU, making it suitable for realtime applications. The practical implications of this method are significant for operational forestry, enabling more accurate inventory management, reducing human errors in manual counting, and optimizing resource allocation. Furthermore, the segmentation capabilities of the model provide a foundation for advanced applications such as eucalyptus stack volume estimation, contributing to a more comprehensive and refined analysis of forestry operations. The methodology’s success in handling complex scenarios, including intersecting logs and varying environmental conditions, positions it as a valuable tool for practical applications across related industrial sectors.

[CV-18] Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox

链接: https://arxiv.org/abs/2605.23747
作者: Allan Kazakov,Duygu Cakir,Hilal Kurt İrfanoğlu,Yavuz İrfanoğlu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Material segmentation, the pixel-wise classification of physical surface properties, remains a challenging problem in computer vision, requiring physicochemical understanding distinct from object-centric parsing. Despite the introduction of the rigorous Apple Dense Material Segmentation (DMS) dataset, the benchmark has suffered from attrition and stagnation, increasingly overshadowed by geometry-biased foundation models. In this paper, we revive the Apple-DMS benchmark to establish a modern Vision Transformer baseline. We conduct an exhaustive evaluation of SegFormer and Mask2Former architectures, revealing that standard training paradigms fail on amorphous texture fields due to high-variance gradients. To address this, we introduce a stabilized training recipe featuring High-Fidelity Logit Projection, Query Entropy Regularization, and a domain-specific, physics-compliant augmentation pipeline. Our optimized SegFormer-B5 achieves a new State-of-the-Art (SOTA) of 0.4572 mIoU on the original dataset split, significantly surpassing the prior convolutional baseline. Furthermore, we identify a critical “Generalization Paradox”: while re-partitioning the dataset into a data-rich 80/10/10 split inflates the metric to 0.5276 mIoU, expert qualitative analysis reveals this induces distributional homogenization, severely degrading real-world, out-of-distribution performance. By releasing our recovered dataset index and robust training framework, we demonstrate that material perception is far from solved and urge the community to leverage the rigorous original split to drive genuine progress in physically grounded artificial intelligence.

[CV-19] Weierstrass Positional Encoding for Vision Transformers

链接: https://arxiv.org/abs/2605.23719
作者: Zhihang Xin,Rui Wang,Xitong Hu,Xiaojun Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Transformers have achieved remarkable success in computer vision, but their common use of learnable one-dimensional positional encodings weakens the inherent two-dimensional spatial structure of images after patch flattening. Existing positional encodings often lack geometric constraints and do not preserve a monotonic relationship between Euclidean spatial distances and sequential index distances, limiting ViTs’ ability to exploit spatial proximity priors. Motivated by the usefulness of periodicity in positional encoding, we propose Weierstrass elliptic Positional Encoding (WePE), a mathematically grounded method for encoding two-dimensional coordinates in the complex domain. WePE maps normalized 2D patch coordinates onto the complex plane and constructs compact four-dimensional positional features using the Weierstrass elliptic function and its derivative. The double periodicity provides a principled representation of 2D positions, and its intrinsic lattice structure naturally matches the regular geometry of image patch grids. Its nonlinear geometric properties help model spatial distance relationships more faithfully, while the algebraic addition formula enables relative positional information between arbitrary patch pairs to be derived directly from their absolute encodings. WePE is plug-and-play and resolution-agnostic, allowing seamless integration into existing ViTs. Extensive experiments show that WePE brings consistent performance gains in most settings. With precomputed lookup tables, these improvements introduce no noticeable computational or memory overhead. Additional analyses and ablation studies further validate the effectiveness of the proposed method.

[CV-20] CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

链接: https://arxiv.org/abs/2605.23699
作者: León Begiristain,Olaf Dünkel,Adam Kortylewski
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 12 figures

点击查看摘要

Abstract:Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model’s predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.

[CV-21] RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video

链接: https://arxiv.org/abs/2605.23672
作者: Chenyu Wu,Wanhua Li,Zhu-Tian Chen,Hanspeter Pfister
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic 3D scenes from monocular videos is a fundamental yet highly challenging task, as real-world motions often involve both long-term smooth transformations and short-term complex deformations. Existing methods either struggle to maintain temporal consistency or fail to capture high-frequency dynamics due to limited motion modeling capacity. In this work, we present Rigid-aware 4D Gaussian Splatting (RiGS), which simultaneously captures motions across multiple temporal scales. Specifically, RiGS introduces three types of Gaussian primitives: static, rigid, and transient, which represent static backgrounds, long-term low-frequency motions, and short-term high-frequency dynamics, respectively. An object-wise dynamic mask is proposed to aggregate long-range spatiotemporal motion information and guide the decomposition of static and dynamic regions. To jointly model motion across scales, rigid Gaussians are allowed to transition into transient Gaussians based on their temporal duration, and both are optimized under scene flow guidance, providing dense 3D motion supervision. Extensive experiments demonstrate that RiGS achieves state-of-the-art performance on novel view synthesis benchmarks. Code is available at \hyperlinkthis https URLthis https URL.

[CV-22] Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models

链接: https://arxiv.org/abs/2605.23656
作者: Maxim Henry,Adrien Deliège,Sébastien Piérard,Marc Van Droogenbroeck
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 3 figures, 4 tables, and 34 references

点击查看摘要

Abstract:Training high-capacity vision models from scratch requires substantial computational resources. To improve training efficiency of a wide target model, existing growth methods often assume the availability of narrower models, obscuring the true computational cost of the entire pipeline. We propose an efficient training protocol, RBDC, that builds wide models by coupling in a parameter-free block-diagonal way narrower, independently trained models in a recursive way. This allows a flexible allocation of the training budget available across all the models involved. Evaluated with vision transformers (DeiT) and convolutional networks (ResNet) on ImageNet, our RBDC training protocol shows a much better efficiency than models trained from scratch with the standard protocol, yielding 30% FLOPs reduction at similar test accuracies. It also achieves higher performances at same training FLOPs than training protocols from the model growth literature. Finally, we show that our models can serve as better backbones than their original counterparts for downstream object detection and instance segmentation tasks.

[CV-23] CVSearch: Empowering Multimodal LLM s with Cognitive Visual Search for High-Resolution Image Perception ICML2026

链接: https://arxiv.org/abs/2605.23655
作者: Liupeng Li,Haoqian Kang,Zhenyu Lu,Jinpeng Wang,Bin Chen,Ke Chen,Yaowei Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted by ICML 2026. 22 pages, 12 figures, 7 tables

点击查看摘要

Abstract:High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at this https URL.

[CV-24] ExpOS: Explainable Open-Surgery Skills Assessment Using 3D Hand Reconstruction

链接: https://arxiv.org/abs/2605.23653
作者: Roi Papo,Idan Smoller,Shlomi Laufer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Timely and transparent feedback is essential for effective surgical training, yet current assessment remains dependent on expert observation, limiting scalability and opportunities for autonomous practice. We present ExpOS, an explainable framework for data-driven assessment of open-surgery skills designed to enable automatic, feedback-oriented evaluation. Rather than relying on expert-defined metrics, ExpOS learns discriminative temporal patterns directly from motion data and identifies the segments and behaviors most predictive of skill level. We trained and evaluated the method on 221 videos of medical students performing three open-surgery tasks. Hand poses and tool detections were extracted from each frame to derive kinematic descriptors and global motion statistics. Spatiotemporal hand-tool dynamics were modeled using a temporal convolutional backbone with attention-based pooling to generate frame-level importance maps. These representations were fused with global motion statistics to predict skill level and to provide interpretable feedback. ExpOS provides multi-level explainability by identifying when informative events occur through attention weights and which motion characteristics most influence predictions through global feature analysis. Across tasks, the framework achieved strong correlation with expert ratings, with best performance on fascial closure (r = 0.778, R2 = 0.74). These results demonstrate that combining weakly-supervised temporal importance learning with interpretable motion statistics enables scalable and actionable surgical skill assessment.

[CV-25] DualMem: Bypassing the Objectness Bottleneck for Calibrated Unknown-Stream Filtering in Open-World Object Detection

链接: https://arxiv.org/abs/2605.23634
作者: Yingjun Xiao(a),Xi Chen(b),Gang Fang©,Siyuan Chen(b) ((a) School of Artificial Intelligence, Guangzhou University, Guangzhou, China, (b) School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China, © Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-world object detection (OWOD) requires detectors to localize known classes while identifying unknown objects for future incremental learning. We find that the unknown prediction streams of strong OWOD detectors are heavily polluted: on M-OWODB, across PROB, OW-DETR, and HypOW, future-task positive unknowns make up less than 10% of unknown predictions, whereas background false positives account for 46-71%. We show that this is not a missing-information problem, but an information bottleneck at the objectness head. On PROB Task 1, a linear probe on the 256-D decoder query achieves an AUROC of 0.908 for positive-versus-negative unknown discrimination, but the final one-dimensional objectness scalar drops to 0.642. A frozen SigLIP feature, without access to the detector, independently recovers much of this proposal-level separability at the filtering stage (AUROC = 0.871). Motivated by this finding, we propose DualMem, a calibrated post-hoc filter that assumes a small image-disjoint annotated calibration split of held-out future-task objects and performs a non-parametric likelihood ratio test in frozen SigLIP feature space. DualMem uses a k-nearest-neighbor positive memory to protect future-task objects and a negative memory to suppress background-like proposals. Its decision threshold is chosen by Neyman-Pearson calibration, giving users an explicit trade-off between false-unknown suppression and novel recall. Across PROB, OW-DETR, and HypOW on M-OWODB Task 1, DualMem reduces background-type false unknown proposals per image by 44.9%-66.3%, with a mean reduction of 56.6%. On PROB Task 1, it more than doubles the reduction achieved by a natural K-means prototype baseline, while leaving known-class mAP unchanged because known detections bypass the filter. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.23634 [cs.CV] (or arXiv:2605.23634v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.23634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-26] DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs

链接: https://arxiv.org/abs/2605.23629
作者: Jiazhen Pan,Weixiang Shen,Jun Li,Julian Canisius,Felix Bitzer,Paula Roßmüller,Jiancheng Yang,Virginie Kreutzinger,Daniel Rueckert,Benedikt Wiestler
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages

点击查看摘要

Abstract:Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.

[CV-27] EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

链接: https://arxiv.org/abs/2605.23610
作者: Jente Vandersanden,Matheus Gadelha,Chun-Hao P. Huang,Hyeonho Jeong,Yulia Gryaditskaya
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.

[CV-28] GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes CVPR

链接: https://arxiv.org/abs/2605.23602
作者: Beibei Lin,Xiao Cao,Jingyuan Guo,Robby T. Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR Findings 2026

点击查看摘要

Abstract:Existing 3DGS methods effectively render high-quality novel views in clear-day scenes. However, they struggle with night scenes, particularly in glow regions, due to the lack of structural features such as textures and edges, which are key cues for splatting-based reconstruction. To address this problem, we leverage a diffusion model and a Vision Foundation Model (VFM) to compensate for missing structural cues. Our method consists of two key novel ideas: semantic feature generation and novel-view semantic learning. First, semantic feature generation produces high-quality semantic features as implicit structural cues for novel views. Specifically, a diffusion model synthesizes novel views with unknown camera poses from training views, while a VFM evaluates their quality. Once high-quality novel views are identified, the VFM extracts robust features to construct the semantic feature bank. Second, novel-view semantic learning enables 3DGS to optimize rendered novel views without requiring ground truth. It achieves this by extracting semantic features from a rendered novel view, searching the feature bank for the most similar features, and minimizing their distance. This process enforces implicit structural constraints, ensuring semantically coherent, artifact-free rendered views. Extensive experiments demonstrate the effectiveness of our GlowGS in generating semantically accurate 3D views, showing significant improvements over existing methods.

[CV-29] Cost-Effective Model Evaluation with Meta-Learning

链接: https://arxiv.org/abs/2605.23595
作者: Trinh Pham,Viet Huynh,Hongzhi Yin,Quoc Viet Hung Nguyen,Thanh Tam Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Performance (cs.PF)
备注:

点击查看摘要

Abstract:The rapid growth of machine learning has produced an ever-expanding ecosystem of models, making it increasingly challenging to verify the reliability of newly released models on unseen, unlabeled data. Conventional evaluation pipelines depend on expensive annotation, repeated fine-tuning, or narrow assumptions that fail to transfer across model families. We present MetaEvaluator, a cost-effective, model-agnostic framework for rapid, label-free assessment of unseen models spanning diverse architectures and modalities. MetaEvaluator leverages meta-learning over a pool of reference models to obtain a transferable initialization, enabling accurate evaluation of new models while amortizing cost across the pool and removing the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework capable of evaluating new models on entirely unlabeled datasets. Extensive experiments show that MetaEvaluator produces stable and accurate performance estimates at substantially reduced cost compared to conventional approaches, making scalable benchmarking of emerging models on unlabeled data practical.

[CV-30] Calibration-Informative Region Selection for Online LiDAR–Camera Calibration in Agricultural Environments ICRA2026

链接: https://arxiv.org/abs/2605.23580
作者: Rajitha de Silva,Grzegorz Cielniak
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICRA 2026 Workshop on Agricultural Robotics

点击查看摘要

Abstract:Reliable multi-modal calibration requires identifying which observations truly constrain the extrinsic parameters and which ones mainly add noise or ambiguity. In this paper, we propose a support-map-driven approach to multi-modal calibration that decouples four functional blocks: initial calibration, cross-modal residual extraction, support-map estimation, and support-aware refinement. We instantiate this formulation for online LiDAR–camera calibration using MDPCalib, a target-less LiDAR–camera calibration method based on motion and deep point correspondences, and CMRNext, a dense LiDAR–camera matching model that predicts optical-flow-like image-plane residuals. The key contribution is a dense calibration support map that aggregates cross-modal agreement over aligned observations and highlights where calibration evidence is consistently reliable. Across the Bacchus Long-Term (BLT) dataset and KITTI, we show that calibration evidence is spatially and semantically non-uniform, indicating that some semantic regions provide stronger cues for calibration than others. On KITTI, support-guided refinement improves the calibration performance with better translation accuracy while rotational gains remain limited.

[CV-31] PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA

链接: https://arxiv.org/abs/2605.23559
作者: Chunze Yang,Qidong Liu,Wenjie Zhao,Yue Tang,Jiusong Ge,Di Zhang,Jiashuai Liu,Lei Wu,Junbo Lu,Ni Zhang,Xian Wu,Zeyu Gao,Chen Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher this http URL code is available online.

[CV-32] Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos

链接: https://arxiv.org/abs/2605.23555
作者: Gangjian Zhang,Jian Shu,Sicheng Yu,Wenhao Shen,Yu Feng,Hao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of reconstructing photorealistic and animatable 3D human avatars from monocular videos. While existing methods rely on combining per-subject optimization with generic human priors, they often fail to capture fine-grained details when training frames are limited. To mitigate this data scarcity, we propose TrioMan, a systematic tri-module framework for augmented 3D avatar learning. Our approach comprises three synergistic components. The Generator creates diverse unseen samples by imposing Gaussian perturbations on pose and camera. The Refiner improves the quality of generated data through one-step diffusion guided by texture and geometry cues. The Examiner selects subject-consistent samples using a dual-branch attention-based similarity evaluation. Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods.

[CV-33] PixIE: Prompted Pixel-Space Low-Light Image Enhancement

链接: https://arxiv.org/abs/2605.23531
作者: Ruirui Lin,Guoxi Huang,David Bull,Nantheera Anantrasirichai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-light images exhibit severe noise, contrast loss, and semantic ambiguity, making enhancement a joint problem of denoising and detail recovery. We propose PixIE, a feed-forward pixel-space LLIE framework semantically-prompted by a vision foundation model. PixIE first performs a cross-scale denoising to suppress noise and preserve structure, then refines details with DINO-Prompted Pixel Blocks (DPPB) that inject intermediate DINOv3 features via patch-conditioned, spatially continuous per-pixel modulation. We introduce a Spatial-Channel Compaction (SCC), which folds features into a compact spatial grid and compresses in the channel dimension, so pixel-attention is computed efficiently with bounded cost across scales. We further propose Multi-Receptive-Field Pixel Embedding (MRPE) to provide neighborhood-aware pixel representations before semantic prompting, improving robustness to signal-dependent noise beyond point-wise embeddings. Experiments on LLIE benchmarks show that PixIE improves the average PSNR by 1.9-15.0% over recent state-of-the-art methods and reduces LPIPS by 8.5-44.4%. Qualitative comparisons further demonstrate that PixIE recovers sharper details and more stable textures, resulting in improved reconstruction fidelity and perceptual quality.

[CV-34] ComPose: When to Trust Hands for Object Pose Tracking

链接: https://arxiv.org/abs/2605.23523
作者: Jisu Shin,Junoh Lee,JunGyu Lee,Inhwan Bae,Dohyeon Lee,Hokyun Im,Youngwoon Lee,Hae-Gon Jeon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 10 figures

点击查看摘要

Abstract:Reconstructing the motion of objects from videos is a key component for embodied AI and robot manipulation. While diverse approaches to object pose tracking have been studied, they rely heavily on strong external priors, such as depth data or 3D templates, and remain highly vulnerable to severe occlusions by hand grasps despite the use of explicit masks. In this work, we present ComPose, a 6DoF object tracking framework designed for hand-aware object pose estimation from RGB video. Rather than treating the hand purely as an occluder, our method harmonizes hand motions as a \textitcomplementary cue for object tracking. In detail, we recover a variety of object motions over time by combining object and hand cues from foundation models within a unified tracking pipeline. Here, ComPose adaptively selects informative hand joints, combines object- and hand-derived cues for motion estimation, and refines the resulting object motion using visible geometric evidence and a learned correction. We further enforce the temporal consistency over both rotation and translation, yielding stable 3D object trajectories over time without any external smoothing. Extensive experiments show that our method is accurate, efficient, and robust under severe hand occlusion and geometric ambiguity. In addition, the resulting trajectories can also effectively transfer to downstream robot manipulation by enabling robots to reconstruct human actions from online videos.

[CV-35] Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

链接: https://arxiv.org/abs/2605.23522
作者: Jade Zou,Tao Huang,Weijie Kong,Junzhe Li,Yue Wu,Qi Tian,Jiangfeng Xiong,Jianwei Zhang,Liefeng Bo,Zhao Zhong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

[CV-36] VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

链接: https://arxiv.org/abs/2605.23518
作者: Zhizhou Chen,Shanyan Guan,Zhanxin Gao,En Ci,Yanhao Ge,Wei Li,Zhenyu Zhang,Jian Yang,Ying Tai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency texture details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution ( \geq 4096 \times 4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. Built on VINS-120K, we further develop a high-frequency-aware post-adaptation strategy to extend pretrained non-high-resolution models to the UHR regime. We also present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work improves fine-grained detail synthesis and texture realism in UHR image editing.

[CV-37] DrawVideo: Generating Long Video from Storyboard Keyframe Sketches

链接: https://arxiv.org/abs/2605.23508
作者: Chuanzhi Xu,Huiqi Liang,Bang Shi,Huiming Zhang,Yifan Xiao,Guangcheng Lin,Haodong Chen,Qiang Qu,Zhicheng Lu,Weidong Cai
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 45 pages, 19 figures

点击查看摘要

Abstract:Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical ‘global multi-shot, local single-sketch’ strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.

[CV-38] MDS-DETR: DETR with Masked Duplicate Suppressor

链接: https://arxiv.org/abs/2605.23507
作者: Chanho Lee,Seunghee Koh,Yunho Jeon,Junmo Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code is available at this https URL

点击查看摘要

Abstract:The DEtection TRansformer (DETR) is a powerful end-to-end object detector, yet its one-to-one matching strategy suffers from slow convergence and low recall. A common approach to address this issue is to use one-to-many label assignment to provide more positive samples. However, existing methods that use one-to-many matching as an auxiliary objective lead to increased training costs, with their auxiliary decoders discarded during inference. To address this limitation, we propose MDS-DETR, which leverages both one-to-one and one-to-many supervision within a single decoder. Specifically, we introduce a Masked Duplicate Suppressor (MDS) that injects asymmetry into self-attention via confidence-based causal masking. MDS filters out the duplicates generated by the one-to-many supervised layer, enables explainable, duplicate-free predictions in a fully end-to-end framework. MDS-DETR outperforms existing one-to-many DETR variants such as MS-DETR, this http URL and Relation-DETR, without relying on any additional queries or auxiliary decoders. Under a 12-epoch training schedule on MS COCO with a ResNet-50 backbone, MDS-DETR achieves a +2.8 mAP improvement over Deformable-DETR with only a 5% increase in training time, and outperforms the state-of-the-art this http URL by +0.3 mAP while being even 20% faster in training. Our code and models are available at \hrefthis https URLthis https URL.

[CV-39] B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

链接: https://arxiv.org/abs/2605.23500
作者: Mario Markov,Stefan Maria Ailuro,Mohammad Mahdi,Luc Van Gool,Danda Pani Paudel(INSAIT, Sofia University “St. Kliment Ohridski”)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Segmentation is a fundamental task in computer vision, underpinning pixel-level scene understanding and serving as a cornerstone for applications ranging from autonomous perception to medical image analysis. For complex referring segmentation, recent methods pair large vision-language models with segmentation decoders: the former analyzes the image and prompt, while the latter predicts the target mask. Although reinforcement learning improves reasoning-intensive vision-language systems, trainable tools such as segmentation decoders are typically optimized separately with differentiable objectives, and the principled integration of such objectives into reinforcement learning remains underexplored. Thus, we introduce group relative tool optimization (GRTO), a mathematically grounded framework for jointly optimizing a policy with differentiable tool use. GRTO reuses group relative policy optimization (GRPO) rollouts to optimize the auxiliary tool objective, letting decoder gradients complement policy rewards. Further, we derive Bootstrapped-GRTO (B-GRTO), a pre-training method that cheaply bootstraps the tool, leading to faster convergence and superior performance. Across three challenging referring segmentation settings, B-GRTO results in substantial improvements over plain GRPO, matching or surpassing domain-specific state-of-the-art methods. This demonstrates the value of unifying reinforcement learning with differentiable auxiliary objectives for reasoning-intensive segmentation.

[CV-40] Multimodal Distribution Matching for Vision-Language Dataset Distillation MDM CVPR2026

链接: https://arxiv.org/abs/2605.23482
作者: Jongoh Jeong,Hoyong Kwon,Minseok Kim,Kuk-Jin Yoon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication at CVPR 2026. Project Page: this https URL

点击查看摘要

Abstract:Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

[CV-41] PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction CVPR2026

链接: https://arxiv.org/abs/2605.23478
作者: Yu Luo,Xiaogang Zhu,Shan Zeng,Wei Xiang,Thomas Francis Bishop,Zhiyong Wang,Kun Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for single-crop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.

[CV-42] Rethinking Transfer Learning for Industrial Inspection: DINOv3 vs. ImageNet Pretraining Across RGB and X-ray Tasks CVPR2026

链接: https://arxiv.org/abs/2605.23472
作者: Mehdi Gharbage,Céline Teulière,Pierre Bouges,Thierry Chateau
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR 2026 Workshop on Vision Foundation Models for Industrial Inspection (VISION’26)

点击查看摘要

Abstract:Vision foundation models pretrained on web-scale data have recently shown strong transfer capabilities on many downstream tasks, but their effectiveness for industrial visual inspection remains unclear. Industrial data differ substantially from web-data and often require fine-grained dense prediction, raising the question of whether modern self-supervised pretraining can improve over the conventional transfer-learning paradigm based on supervised ImageNet initialization. In this work, we compare ConvNeXt backbones pretrained with supervised ImageNet classification or DINOv3 distillation, and relate them to the conventional ResNet-50 baseline. We evaluate semantic segmentation, instance segmentation, and object detection across four downstream datasets spanning RGB surface-defect inspection and X-ray defect detection. We further study both frozen and fully finetuned adaptation regimes. Our results show that DINOv3 offers no clear advantage in frozen transfer, but provides a stronger initialization after full finetuning on RGB tasks, yielding faster convergence and better final performance. Under X-ray modality shift, however, supervised ImageNet pretraining remains more effective in both frozen and finetuned settings. Overall, our findings suggest that modern vision foundation models are promising for supervised RGB industrial inspection, but their transferability is strongly conditioned by downstream adaptation and target modality.

[CV-43] One-Forcing: Towards Stable One-Step Autoregressive Video Generation

链接: https://arxiv.org/abs/2605.23458
作者: Jiaqi Feng,Justin Cui,Yuanhao Ban,Cho-Jui Hsieh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work in Progress. Project Page: this https URL , Code: this https URL

点击查看摘要

Abstract:Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.

[CV-44] Efficient One-Step Diffusion Restoration Model with Compact Token Compression and Linear Attention

链接: https://arxiv.org/abs/2605.23451
作者: Bingtian Qiao,Yue Shi,Yingjie Zhou,Yong Guo,Guangtao Zhai,Jiezhang Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world image super-resolution aims to recover high-quality images from complex and unknown real-world degradations. However, existing generative Real-ISR methods largely inherit the dense latent representations and quadratic-cost global modeling paradigm developed for high-resolution image synthesis, causing computation, memory usage, and inference latency to scale unfavorably with resolution and thus limiting practical deployment. We argue that the key bottleneck lies not in insufficient restoration priors, but in excessive token redundancy and costly token interactions during high-resolution restoration. Motivated by this observation, we revisit Real-ISR from the perspectives of compact latent representation and linear-complexity modeling, and propose SANA-SR, an efficient one-step restoration framework. Specifically, SANA-SR employs a deep compression autoencoder with a 32x compression ratio to drastically reduce latent tokens while preserving restoration-relevant structures and textures. On top of this compact latent space, we introduce a linear-attention DiT with LoRA fine-tuning, enabling efficient high-resolution restoration with linear-complexity token mixing. Extensive experiments on all benchmark datasets demonstrate that SANA-SR achieves highly competitive and often superior quantitative performance against existing methods, while restoring clearer and more realistic textures. Moreover, after pruning, the deployed model runs in 0.019s with 407.95G MACs and 344M parameters, highlighting its strong potential for practical mobile deployment.

[CV-45] Commutator-Induced Uncertainty in VAEs

链接: https://arxiv.org/abs/2605.23449
作者: Tahereh Dehdarirad,Michael Felsberg,Gabriel Eilertsen,Ziliang Xiong
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注:

点击查看摘要

Abstract:Variational autoencoders (VAEs) often struggle to represent non-commutative structure in learned latent spaces. Symmetry-aware VAEs commonly address this issue by enforcing commutativity through algebraic regularization, which is appropriate for commutative transformation groups but can suppress meaningful non-commutative structure when it is intrinsic to the data. We argue that non-commutativity should instead be explicitly diagnosed and reflected in reconstruction behavior. We introduce a Lie Group VAE framework that combines geometric and algebraic perspectives on uncertainty while separating discrete generative factors from continuous geometric transformations. In a first phase, the model is trained without structural constraints while algebraic non-commutativity is measured through finite Baker-Campbell-Hausdorff deviations and decoder order sensitivity is measured through reconstruction order-swap tests. These diagnostics reveal a scale mismatch between latent non-commutativity and reconstruction behavior under unconstrained training. In a second phase, we introduce a deformation-stability constraint with a data-driven calibration constant that aligns decoder sensitivity with algebraic non-commutativity. We evaluate the framework on dSprites, 3DShapes, 3DCars, and CelebA against generic and symmetry-aware baselines, including beta-VAE, CLG-VAE, and CFASL. Across synthetic benchmarks, the method improves reconstruction quality and yields decoder-level behavior more consistent with latent non-commutative structure. Qualitative analyses show clearer order-dependent latent compositions and more stable reconstructions. On CelebA, the model yields more faithful reconstructions and factor-specific latent traversals than CFASL, while also exhibiting meaningful order-dependent interactions between learned latent directions.

[CV-46] DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation ICML2026

链接: https://arxiv.org/abs/2605.23445
作者: Jie Hu,Zixiang Gao,Yutong He,Kun Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026; 17 pages, 8 figures;

点击查看摘要

Abstract:Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve-based token reordering to achieve fine-grained sparsity while preserving efficient GPU execution, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1 \times end-to-end speedup while maintaining high generation quality. Our code is open-sourced and available at this https URL.

[CV-47] FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis

链接: https://arxiv.org/abs/2605.23428
作者: Kakia Panagidi,Stathes Hadjieftymiadis
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In modern multimedia systems, efficient video processing is critical, especially in resource-constrained environments such as IoT-based camera networks, autonomous platforms, and wireless sensor multimedia systems. A key bottleneck in video compression and understanding is block motion estimation (ME), a process that remains computationally expensive despite the development of fast search techniques. This work introduces an Optimal Stopping Theory (OST) algorithm for block motion estimation based on the assessment of spatiotemporal differences within and across video frames. It also proposes a semantic-aware motion estimation framework that integrates Foundation Models (FMs) with the OST-based decision process. By leveraging pretrained visual models such as Vision Transformers (ViT) and the Segment Anything Model (SAM), the framework extracts semantic attention scores that indicate the importance of motion within specific spatial regions. These scores are fused with traditional distortion-based metrics, such as the Sum of Absolute Differences (SAD), to guide a hybrid stopping criterion that jointly considers motion magnitude and semantic relevance. The resulting adaptive algorithm stops early in redundant regions while continuing the search in areas where motion is semantically significant. Experiments compare the proposed solution with widely used approaches from the literature on benchmark and multimodal video datasets. The proposed method achieves a significant reduction in computation with minimal accuracy loss and improved semantic coverage. The results highlight the benefits of bridging low-level motion analysis with high-level semantic reasoning, offering a promising direction for efficient multimodal video understanding in next-generation smart systems.

[CV-48] Sample-wise Targeted Adversarial Attacks on Test-time Adaptation

链接: https://arxiv.org/abs/2605.23411
作者: Phuc Duc Nguyen,Quang Duc Nguyen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 17 figures

点击查看摘要

Abstract:Test-time adaptation (TTA) effectively counters distribution shifts but exposes models to adversarial manipulation via the unlabeled test stream. Existing class-wise targeted attacks remain impractical for stealthy exploitation in this setting: since TTA operates on batches, forcing a subset of samples toward a target label unintentionally pulls similar benign samples along, resulting in a conspicuously high frequency of the target label that is easy to detect. To capture a more realistic threat, we introduce a sample-wise targeted attack. Unlike prior approaches, the attacker aims to misclassify only inputs carrying an attacker-chosen trigger, while preserving the global label distribution of benign queries to evade detection. To achieve this, we propose a meta-learning-based attack with a novel priority-aware gradient alignment strategy that explicitly prioritizes attack success. The strategy formulates the gradient update as an ellipsoidal trust-region problem, mitigating the misalignment between attack success and distributional stealth, while providing theoretical guarantees for effective optimization of the attack objective in the presence of gradient misalignment. Extensive experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across TTA protocols demonstrate that our method achieves high targeted success rates while maintaining a label distribution that is consistent with the no-attack baseline, making it difficult to detect in unlabeled TTA deployment scenarios. Furthermore, we demonstrate that our attack shows strong robustness against existing defenses.

[CV-49] What Linear Probes Miss: Multi-View Probing for Weight-Space Learning ICML2026

链接: https://arxiv.org/abs/2605.23410
作者: Eunwoo Heo,Kyeongkook Seo,Jaejun Yoo
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026. Code: this https URL ; Project page: this https URL

点击查看摘要

Abstract:The explosive growth of open-source model repositories has created a Model Jungle, where checkpoints are frequently shared without adequate documentation or metadata. While weight-space learning offers a pathway to identify and analyze these models directly from their parameters, processing full-scale weights is computationally prohibitive. Probing-based methods have emerged as a lightweight alternative, extracting permutation-equivariant representations via learnable probe vectors. However, existing probing methods are limited by a single-view design: they capture first-order structures but fail to encode the rich, higher-order correlation patterns inherent in row-column interactions. To bridge this gap, we introduce MVProbe, a multi-perspective probing framework that synthesizes first-order signals with interaction-aware (Gram-based) views. Our approach is theoretically grounded; we analyze the scaling laws of different probing orders to derive a principled standardization and fusion strategy that ensures balanced contributions from all branches. On the Model Jungle benchmark, MVProbe consistently outperforms the state-of-the-art ProbeX across diverse architectures, including discriminative backbones (ResNet, SupViT, MAE, DINO) and large-scale generative LoRA adapters (Stable Diffusion LoRA).

[CV-50] Online Hand Gesture Recognition Using 3D Convolutional Neural Networks

链接: https://arxiv.org/abs/2605.23409
作者: Yinghao Qin,Tijana Timotijevic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Master’s dissertation work written in Autumn 2020

点击查看摘要

Abstract:In human computer interaction, real-time detection and classification of dynamic hand gestures is challenging as: 1) the system must run in a real-time video stream and there is no noticeable lag in response after performing a gesture; 2) there is a large difference in how people perform gestures, making recognition more difficult. In this paper, an online hand gesture recognition system is proposed, which is able to localize gestures in real-time video stream and recognize what these gestures are. To improve the robustness of the system, the sliding window approach is used to refine results from multiple windows. All of the models in my project are trained on Jester database, achieving 98+% accuracy for detector and 90+% accuracy for classifier. For the overall performance of the system, the best group can respond within three seconds and reach 37.5% Levenshtein accuracy on the homemade dataset. The project codes used in this work are publicly available.

[CV-51] RS2AD-LiDAR: End-to-End Autonomous Driving LiDAR Data Generation from Roadside Sensor Observations

链接: https://arxiv.org/abs/2605.23406
作者: Runyi Huang,Ni Ding,Ruidan Xing,Yuheng Shi,Lei He,Keqiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving solutions, which directly process multimodal sensory data and output fine-grained control commands, have gradually become a mainstream direction with the development of autonomous driving technology. However, current methods in this category rely on single-vehicle data collection for model training and optimization, which suffers from high acquisition and annotation costs, scarcity of valuable scenarios, and data silos. To address these challenges, we propose RS2AD-LiDAR, a novel framework for reconstructing and generating vehicle-mounted LiDAR data from roadside sensor observations. Since no public dataset currently provides highly overlapping perception coverage between roadside and vehicle-mounted LiDAR sensors, which is essential for studying roadside-to-vehicle data generation, we constructed a dedicated dataset named R2V-LiDAR which is used solely for evaluation in this work. Specifically, our method transforms roadside LiDAR point clouds into the vehicle-mounted LiDAR coordinate system, and synthesizes high-fidelity vehicle-mounted data via virtual LiDAR modeling and point cloud resampling techniques. To the best of our knowledge, this is the first approach to reconstruct vehicle-mounted LiDAR data from roadside sensor inputs. Extensive experimental comparisons demonstrate the semantic similarity between the generated data and real data. Furthermore, object detection experiments show that incorporating the generated data into real data for model training improves both Bird’s Eye View (BEV) and 3D detection accuracy, thereby validating the effectiveness of the proposed method.

[CV-52] Joint Target-Less Intrinsic and Extrinsic Camera-LiDAR Calibration using Deep Point Correspondences

链接: https://arxiv.org/abs/2605.23397
作者: Simon Bultmann,Daniele Cattaneo,Abhinav Valada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: presented at 2nd German Robotics Conference (GRC)

点击查看摘要

Abstract:Accurate camera-LiDAR calibration is a prerequisite for robust multi-modal perception in robotics. Recent target-less approaches based on deep point correspondences achieve remarkable performance for extrinsic calibration but assume rectified images with known intrinsics. In this work, we overcome this limitation and present the first fully target-less pipeline that jointly estimates camera intrinsics (pinhole model with radial-tangential distortion) and camera-LiDAR extrinsics with deep pixel-point correspondences. Our approach extends deep correspondence-based calibration by (i) automatic intrinsic initialization via structure-from-motion, (ii) generalizing camera-LiDAR matching to raw images with unknown intrinsics including distortion, and (iii) tightly coupling correspondence estimation with joint nonlinear optimization over both intrinsics and extrinsics. We evaluate our method on the KITTI dataset with unseen camera-LiDAR pairs and demonstrate that joint calibration achieves improved extrinsic accuracy while additionally recovering accurate intrinsics.

[CV-53] VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation CVPR2026

链接: https://arxiv.org/abs/2605.23381
作者: Junwen Tan,Jinglin Liang,Hongyuan Chen,Shuangping Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Prior acceleration methods reuse cached features from previous steps, which neglects the growing mismatch between static caches and the evolving input, leading to reduced output fidelity. This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating. Specifically, VDE decomposes the model’s velocity into components parallel and orthogonal to the input, exploiting their temporal predictability and directional stability for precise, input-adaptive estimation. To prevent error accumulation, it periodically anchors the model’s state via full forward passes. Extensive experiments on image and video generation tasks demonstrate that VDE achieves substantial acceleration with minimal loss in visual quality. Notably, VDE accelerates Flux by 3.22 times and achieves an LPIPS of 0.069 on Qwen-Image, outperforming the best baseline with a 52.2% reduction.

[CV-54] Decoupling Spatio-Temporal Adapter for Fine-Grained Badminton Action Localization

链接: https://arxiv.org/abs/2605.23355
作者: Tianyu Wang(1),Junjie Wu(1 and 2),Jingquan Gao(1),Shishuo Li(1) ((1) School of Economics and Management, Beihang University, Beijing 100191, China (2) Key Laboratory of Data Intelligence and Management, Beihang University, Ministry of Industry and Information Technology, Beijing 100191, China)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 11 pages, 11figures

点击查看摘要

Abstract:Temporal Action Localization (TAL) has been extensively studied in generic video understanding, while fine-grained sports scenarios, such as professional badminton, remain underexplored due to their complex and subtle spatio-temporal dynamics. In this paper, we focus on fine-grained TAL in professional badminton videos and introduce a new benchmark dataset, Fine-Badminton, which consists of 31 matches with 29 fine-grained stroke categories, covering 2104 rallies and 27597 annotated actions. To effectively capture the intricate motion patterns in such scenarios, we propose a Decoupling Spatio-Temporal Adapter (DSTA), which enables efficient modeling of spatio-temporal features within a parameter-efficient framework. Specifically, DSTA decomposes motion representation into three parallel branches, capturing temporal dynamics as well as vertical and horizontal spatial variations. The design allows the model to better distinguish subtle differences among fine-grained actions. Extensive experiments on both the Fine-Badminton dataset and the ShuttleSet benchmark demonstrate that the proposed method achieves state-of-the-art performance while introducing only a marginal increase in computational and parameter cost. These results validate the effectiveness and efficiency of the proposed approach for fine-grained temporal action localization.

[CV-55] SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

链接: https://arxiv.org/abs/2605.23345
作者: Zizhao Tong,Hongfeng Lai,Zeqing Wang,Zhaohu Xing,Kexu Cheng,Haoran Xu,Zhao Pu,Shangwen Zhu,Ruili Feng,Jian Zhao,Yan Zhang,Hao Tang,Yeying Jin,Ling Shao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL . Code is available at this https URL

点击查看摘要

Abstract:Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.

[CV-56] CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs

链接: https://arxiv.org/abs/2605.23344
作者: Xiaoyi Huang,Kejia Zhang,Zhiming Luo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for “calibration on demand”. CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.

[CV-57] GFSR: Geometric Fidelity and Spatial Refinement for Reliable Lane Detection

链接: https://arxiv.org/abs/2605.23327
作者: Tiancheng Wang,Zhaolu Ding,Richeng Xu,Tianhui Zheng,Hui Liu,Hanyu Xuan,Zhiliang Wu,Guanghui Yue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Intelligent Transportation Systems (under review). 12 pages, 6 figures

点击查看摘要

Abstract:Lane detection stands as a crucial perception task in autonomous driving and advanced driver assistance systems. However, existing methods still degrade in complex real scenarios due to two major limitations. First, classification confidence only characterizes the categorical existence of lane candidates and has no strong correlation with geometric quality. If threshold filtering and NMS are conducted merely based on this confidence, the model tends to retain lane priors with high confidence while eliminating those with lower confidence but superior geometric representation. Secondly, existing regression modules weaken correlations among sampling points, hindering fine-grained optimization of distant, high-curvature and complex-topology lanes and causing underfitting. To address these issues, we propose Geometric Fidelity and Spatial Refinement (GFSR), a framework consisting of LaneIoU-guided Confidence Calibration (LCC) and Adaptive Gated Location Refinement (AGLR). Specifically, LCC adopts LaneIoU as soft supervision to explicitly estimate geometric fidelity of lane priors, which is further fused with classification confidence to construct the collaborative reliability index (CRI). This index guides threshold filtering and NMS, effectively retaining lane priors with high classification confidence and favorable geometric quality. Meanwhile, cooperating with regression heads in each refinement stage, AGLR predicts sampling point lateral offsets and adopts a gating mechanism to adaptively regulate correction magnitude, strengthen inter-point correlations and boost model adaptability as well as robustness toward complex lane scenarios. Extensive experiments on CULane and CurveLanes demonstrate that our GFSR achieves state-of-the-art performance on CULane, with F1@50 and F1@75 scores of 81.46% and 65.01%, and reaches 87.35% F1@50 on CurveLanes.

[CV-58] Enhancing Blood Cells Classification using Hybrid Quantum Neural Networks

链接: https://arxiv.org/abs/2605.23324
作者: Guilherme Cruz,Nouhaila Innan,Alberto Marchisio,Gabriel Falcao,Muhammad Shafique
类目: Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
备注: 11 pages, 13 figures

点击查看摘要

Abstract:Accurate classification of microscopic blood cells is still a critical task in medical image analysis, where subtle variations and limited data can challenge conventional deep learning models. As such, we investigate in this work the potential of Hybrid Quantum-Classical Neural Networks (HQNNs) to enhance feature representation and improve classification performance in this domain. We propose a modular architecture combining a pre-trained ResNet-50 backbone with a low-dimensional latent bottleneck and a variational quantum circuit, enabling a direct comparison between quantum-enhanced and purely classical transformation mechanisms. To isolate the contribution of the quantum component, we evaluate three architectures: a HQNN model, a Classical Matched Model with an additional nonlinear transformation layer of comparable capacity, and a baseline model without an intermediate transformation stage. Experiments conducted on two publicly available blood cell datasets, namely the Blood Cell Images dataset and the PBC dataset, demonstrate that HQNNs consistently achieve superior or more balanced performance across evaluation metrics. In the Blood Cell Images Dataset, the proposed approach improves macro F1-score by up to 3.7% compared to classical baselines, while improving the F1-score from 98.54% to 98.69% in the more challenging 8-class scenario with near-saturated performance. Additional evaluation on IBM quantum hardware shows that the model remains robust under noise, with only a modest performance degradation relative to simulated results. These results indicate that quantum feature transformations can enhance discriminative representations, particularly in challenging classification scenarios, and highlight the practical potential of HQNN models for medical imaging tasks.

[CV-59] General Hazard Detection

链接: https://arxiv.org/abs/2605.23304
作者: Stephanie Ng,CP Lim,SueJen Looi,Hendrik Zurlinden,David Nguyen,Lei Wei,Saeid Nahavandi,Hailing Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures and 4 tables

点击查看摘要

Abstract:Hazard, as an abstract concept, is typically defined through cognitive-level logical reasoning rather than concrete examples. In contrast, existing hazard detection systems rely on predefined hazard categories and require intensive collection of labelled examples within detection or classification architectures. This approach faces three fundamental challenges when addressing abstract safety concepts: (1) noisy and sparse training data, (2) dynamically evolving definitions that change across contexts and time, and (3) limited generalisation to unseen or novel scenarios. To address these limitations, we present the CompliVision dataset, the first general-purpose hazard dataset designed for rule-based compliance assessment, along with a baseline framework for hazard evaluation. Our key innovation is decoupling the hazard concept from image-based examples by expressing safety requirements through language-based rules. We ground our approach in authoritative domain regulations and ISO standards to define diverse hazard concepts across multiple domains. The CompliVision dataset comprises 3,006 images spanning traffic, construction, and warehouse environments, with each image annotated for compliance against specific safety rules, accompanied by natural language explanations highlighting the supporting visual evidence. To achieve robust generalisation, we develop an active learning framework to more effectively guide and refine vision-language models in assessing hazard compliance. While state-of-the-art VLMs demonstrate strong capabilities, they struggle with the fine-grained, context-dependent interpretation required for accurate safety assessment. We proposed a general hazard detection framework to address this limitation which combines LLaVA-based visual reasoning with with human-in-the-loop feedback.

[CV-60] Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition

链接: https://arxiv.org/abs/2605.23288
作者: Yerim So,Jiyeong Kim,Jiwon Yoon,Dongbo Min
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Open-Vocabulary Action Recognition (OVAR) methods typically aggregate visual features into a global representation before computing text alignment, a process that obscures local patch information and fine-grained spatio-temporal cues. We propose Similarity Volume Aggregation (SimVA), a framework that constructs a dense 4D spatio-temporal similarity volume from patch-level visual-text similarities. SimVA constructs a spatio-temporal similarity volume over local video tokens and action classes, and employs class sampling to ensure similarity aggregation scalable to large vocabularies. The similarity volume is refined by spatial aggregation, which contextualizes local similarity patterns to improve intra-frame consistency. Motion-aware modulation further injects inter-frame variation cues, highlighting dynamically changing regions. Mamba-based temporal aggregation then models the evolution of class-conditioned similarity patterns across frames. By maintaining dense visual-text correspondence, SimVA effectively transfers CLIP to video action recognition, achieving competitive performance across zero-shot, few-shot, and base-to-novel benchmarks.

[CV-61] LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images CVPR

链接: https://arxiv.org/abs/2605.23287
作者: Yilong Liu,Wanhua Li,Chen Zhu-Tian,Hanspeter Pfister
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRF 2026

点击查看摘要

Abstract:We present LangFlash, a feed-forward framework for 3D Language Gaussian Splatting that reconstructs 3D scenes parameterized by Gaussian primitives enriched with language-aligned semantic features from sparse unposed multi-view images. Unlike optimization-based 3D methods, LangFlash directly predicts the geometry and semantics in a single forward pass, enabling low-latency 3D reconstruction and language-consistent scene understanding. To support large-scale training, we enriched the RealEstate10k dataset with coherent and dense semantic information for 3D semantic supervision. Furthermore, we propose a sparse semantic encoding scheme that combines a global semantic dictionary with locally varying per-primitive weights, preserving high-level linguistic information, while reducing representation complexity. Experimental results show that LangFlash achieves superior novel view synthesis and semantic consistency compared with previous methods. This study establishes a new paradigm for pose-free, language-grounded 3D scene reconstruction, advancing generalizable 3D vision and multimodal scene understanding. Demo is available at this https URL.

[CV-62] DepthAgent : Towards Better Universal Depth Estimation via Sample-wise Expert Selection

链接: https://arxiv.org/abs/2605.23281
作者: Jie Zhu,Girish Chandar Ganesan,Xiaoming Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular metric depth estimation has achieved strong progress with large-scale training and universal-camera modeling, yet robust deployment across diverse camera settings, such as perspective, fisheye, and panoramic images, remains challenging. Existing methods typically rely on a single depth estimator, overlooking that different models encode different camera assumptions and perform best under different input domains. In this paper, we show that depth experts exhibit strong sample-wise complementarity: model preference is highly correlated with camera geometry, and multi-model fusion brings the largest gains on difficult samples where individual experts are unreliable. Motivated by these observations, we propose \textbf\ours, a vision-language agent for adaptive monocular depth estimation. DepthAgent treats existing depth models as frozen tools and learns to analyze scene and camera cues, invoke suitable experts through multi-turn tool utilization, and select or fuse their predictions for each input. To optimize such discrete decision-making toward dense geometric quality, we design a multi-reward reinforcement fine-tuning scheme that jointly encourages valid tool execution, camera/scene analysis, expert-selection quality, and inference efficiency. Extensive experiments across perspective, fisheye, and panoramic benchmarks show that \ours consistently outperforms individual experts, fixed model fusion, and different selection strategies, with strong improvements on challenging samples, highlighting the critical role of expert selection and fusion. The code and model will be released upon publication.

[CV-63] U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025

链接: https://arxiv.org/abs/2605.23274
作者: Duc-Nhuan Le,Hoang-Phuc Nguyen,Thanh-Duy Lam,Minh-Nhut Dang,Minh-Hoang Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

点击查看摘要

Abstract:Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.

[CV-64] EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

链接: https://arxiv.org/abs/2605.23271
作者: Songlin Yang,Haobin Zhong,Ruilin Zhang,Xiaotong Zhao,Shuai Li,Kai Zheng,Xuyi Yang,Zhe Wang,Zhenchen Tang,Yang Li,Bohai Gu,Zhengwei Peng,Yidan Huang,Mengzhou Luo,Yihang Bo,Dalu Feng,Yujia Zhang,Juntao Ma,Ruiqi Wang,Lvmin Zhang,Yuwei Guo,Frank Guan,Maneesh Agrawala,Hongbo Fu,Alan Zhao,Anyi Rao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ‘‘whether it is right’’ (basic prompt-following) while fundamentally neglecting ‘‘whether it is good’’ (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ‘‘rightness’’ metrics, but also significantly expands the criteria to ‘‘goodness’’ and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.

[CV-65] ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

链接: https://arxiv.org/abs/2605.23270
作者: Xiyang Wang,Xinlin Wang,Tingguang Zhou,Gong Chen,Xingtai Gui,Zhi Xu,Xiaolei Wu,Feiyang Tan,Hangning Zhou,Mu Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies via causal factorization, but their step-wise decoding leads to error accumulation and suboptimal global structure. In contrast, diffusion models optimize trajectories globally but lack explicit causal constraints, making them unreliable in interactive and safety-critical scenarios. This dichotomy reveals a deeper issue: existing methods treat causal modeling and global optimization as separate paradigms, without a principled way to unify them within a single trajectory distribution. To address this, we propose ChainFlow-VLA, which unifies causal generation and global refinement within a unified probabilistic framework. We formulate planning as a mixture over AR-induced modes and learn Vision-Language Model (VLM)-conditioned residual distributions over these modes. An autoregressive generator (Chain) produces a discrete set of causal trajectory modes, followed by a diffusion-based refiner (Flow) that leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure. This straightforward conditioning seamlessly injects high-level scene understanding into fine-grained trajectory adjustments. Experiments demonstrate that ChainFlow-VLA achieves robust planning in ambiguous and long-tail scenarios, achieving a state-of-the-art score of 94.85 on the NAVSIM v1 leaderboard, matching human-level performance (94.8). Code will be available at this https URL.

[CV-66] Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution ICML2026

链接: https://arxiv.org/abs/2605.23264
作者: Hongbo Wang,Huaibo Huang,Pin Wang,Jinhua Hao,Chao Zhou,Ran He
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignment between isotropic objectives and the intrinsic natural image manifold. While Direct Preference Optimization offers a path to alignment, its reliance on spectrally flat Gaussian noise fails to distinguish authentic high-frequency details from hallucinations. To bridge this geometric gap, we propose ASASR, a theoretically grounded framework that recasts the generative flow into a Sobolev-induced Riemannian geometry by explicitly coloring the noise transition kernel to mirror natural spectral decay. Driving this geometric alignment, we integrate a parametric adversary grounded in the Riesz Representation Theorem, which synthesizes targeted negative samples equivalent to worst-case Sobolev gradients to direct optimization along the tangent space of plausible structural failures. Extensive evaluations demonstrate that ASASR outperforms leading generative baselines, particularly in preserving spectral consistency and structural fidelity, offering a robust solution that effectively mitigates artifacts.

[CV-67] urning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation ICML2026

链接: https://arxiv.org/abs/2605.23257
作者: Zixuan Hu,Xuantuo Huang,Yancheng Li,Yichun Hu,Shengyong Xu,Ling-Yu Duan
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Navigating under non-stationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online adaptation as transient, isolated updates, leading to catastrophic forgetting and negative transfer. To overcome these issues, we propose Inter-Domain BridgE with Historical Assets (IDEA), a novel TTA framework that transforms adaptation into the accumulation and composition of assets. Specifically, IDEA introduces soft prompts optimized via a Fisher-guided weighting scheme to capture the transferable knowledge. These optimized prompts are then augmented with domain coordinates to form a dynamic asset library. Leveraging this library, IDEA constructs a cross-domain bridge by projecting the target domain onto the convex hull of historical knowledge. These designs form a complementary loop: the evolving library underpins bridge construction, while the bridge provides superior initialization to accelerate asset optimization. Extensive experiments across REVERIE, R2R, and R2R-CE benchmarks demonstrate the consistent superiority of IDEA over existing methods, showcasing its ability to enable training-free adaptation via asset sharing.

[CV-68] CARE: Class-Adaptive Expert Consensus for Reliable Learning with Long-Tailed Noisy Labels ICML2026

链接: https://arxiv.org/abs/2605.23254
作者: Mengke Li,Haiquan Ling,Lihao Chen,Yang Lu,Yiqun Zhang,Hui Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: poster in ICML 2026

点击查看摘要

Abstract:Learning from real-world data is frequently hindered by the compound challenge of long-tailed class distributions and noisy annotations. Existing methods partially address these issues but typically ignore the non-uniform impact of label noise across classes, resulting in ineffective correction for tail classes and over-regularization for head classes. To address this issue, we propose Class-Adaptive Rectification with Experts (CARE), a parameter-efficient framework that leverages three complementary supervision sources from vision-language models (VLM): observed noisy labels, VLM text embeddings, and visual features. CARE introduces a class-adaptive expert consensus mechanism that enforces stricter agreement for tail classes and more permissive agreement for head classes based on class frequency. By aggregating high-confidence predictions across these sources, CARE filters unreliable signals and recalibrates class distributions, yielding more reliable rectification under long-tailed distributions. Extensive experiments on both synthetic and real-world benchmarks demonstrate that CARE consistently outperforms state-of-the-art methods, achieving up to 3.0% performance gains. The source code is available at this https URL.

[CV-69] SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion ICME2026

链接: https://arxiv.org/abs/2605.23245
作者: Xinyu Chen,Yuyi Qian,Jiang Lin,Shenyi Wang,Gao Wang,Zhiqiu Zhang,Jizhi Zhang,Mingjie Wang,Qiang Tang,Qian Wang,Song Wu,Zili Yi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICME2026

点击查看摘要

Abstract:Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting their flexibility and generalization. To bridge this gap, we present \textitSimInsert, a training-free paradigm that efficiently decouples the task into intuitive single-frame editing and semantic motion description. By harnessing the robust generative priors of image-to-video diffusion models, SimInsert propagates edits temporally, strictly preserving background invariance while enabling plausible, text-driven interactions between the inserted object and the dynamic environment. Our approach hinges on non-invasive guidance mechanisms that enforce structural consistency, facilitate seamless boundary fusion, and counteract the fidelity drift that typically accumulates during the denoising trajectory. Extensive quantitative experiments validate our efficacy: SimInsert surpasses state-of-the-art methods with an 18.8% gain in PSNR, 20.1% in SSIM, and a 44.1% decrease in LPIPS, offering a streamlined solution for high-fidelity video editing.

[CV-70] StereoGenBench: A Synthetic Multi-Camera Benchmark for Stereo Generation under Controlled Baseline Regimes

链接: https://arxiv.org/abs/2605.23237
作者: Yangzhi Cui,Feng Qiao,Nathan Jacobs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo image and video generation, stereo geometry estimation, and condition-controlled view synthesis require paired data in which the variables that determine binocular geometry – camera baseline, intrinsics, scene depth, and camera motion – are known and controllable. Existing stereo resources provide subsets of these variables, but resources commonly used for stereo generation evaluation do not, to our knowledge, provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source. We introduce StereoGenBench, a synthetic Unreal Engine benchmark designed to make baseline-regime sensitivity and target-camera consistency measurable under matched scene content. Each scene is rendered with a rigid six-camera lateral array, yielding up to 15 calibrated view pairs; adjacent baselines are sampled from inter-pupillary to wide-baseline regimes; focal length is sampled independently; and every view is released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses. The splits include two evaluation families for narrow and wide baseline regimes and a train-only family for broader all-pairs coverage. We release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration for extension with compatible assets. The dataset is available at this https URL

[CV-71] Beyond Normal References: Discriminative Few-Shot Anomaly Detection

链接: https://arxiv.org/abs/2605.23231
作者: Huan Wang,Jun Shen,Jun Yan,Guansong Pang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages

点击查看摘要

Abstract:This paper considers a practical few-shot anomaly detection (FSAD) setting, termed discriminative FSAD, where a limited number of both normal and anomalous examples are available as references during inference. Existing FSAD methods rely on normal-only references through normality matching, ignoring the discriminative clues in anomalous references, while directly fitting both references can overfit to the seen anomalies. We introduce IDEAL, an intrinsic deviation learning framework that leverages both reference types to learn intrinsic deviation patterns characterizing generalizable abnormality as deviations from normality. IDEAL decomposes the learning process into two novel components: 1) a Normal Variation Eraser to suppress nuisance normal variations that may lead to noisy deviations from normality, thereby highlighting anomaly-relevant deviation representations; 2) an Intrinsic Deviation Encoder to decompose these denoised deviation representations into intrinsic deviation vectors capturing the most discriminative orthogonal deviation directions. At inference, IDEAL scores query-to-normal deviations preserved after projection onto the learned intrinsic deviation vectors, enabling generalization for both seen and unseen anomalies. Extensive experiments on eight real-world datasets show that IDEAL generalizes effectively to unseen anomalies and consistently outperforms existing state-of-the-art FSAD methods. Code and data will be available at \hrefthis https URLthis https URL.

[CV-72] CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering CVPR2026

链接: https://arxiv.org/abs/2605.23216
作者: Mingfang Zhang,Jingjing Pan,Ashutosh Kumar,Rajat Saini,Mustafa Erdogan,Hsuan-Kung Yang,Caixin Kang,Yifei Huang,Yoichi Sato,Quan Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.

[CV-73] Lipschitz Optimization for Formal Verification of Homographies CVPR2026

链接: https://arxiv.org/abs/2605.23203
作者: Jean-Guillaume Durand,Panagiotis Kouvaros,Maxime Gariel,Alessio Lomuscio
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 18 pages, 13 figures, 6 tables, to be published at CVPR 2026

点击查看摘要

Abstract:The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to \ell_p -norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at this https URL .

[CV-74] Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing

链接: https://arxiv.org/abs/2605.23192
作者: Lin Liu,Zhihan Xiao,Haohang Xu,Rong Cong,Zhibo Zhang,Xiaopeng Zhang,Qi Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video editing has recently achieved remarkable progress with diffusion-based generative models, enabling diverse object-level manipulations from natural language instructions. However, existing methods often struggle under occlusion, viewpoint changes, and fast object motion, where unreliable visual observations lead to inaccurate localization, temporal flickering, and inconsistent edits. In this work, we identify the absence of reliable visual anchors as a fundamental bottleneck in occlusion-robust video editing. To address this issue, we propose an occlusion-aware physics-semantic keyframe selection framework that automatically identifies an optimal anchor frame for downstream editing. Specifically, our method evaluates candidate frames from three complementary perspectives: structural completeness for avoiding truncated observations, cycle-consistent tracking stability for measuring physical reliability, and vision-language-based attribute visibility for ensuring semantic clarity. The selected keyframe is then propagated through bidirectional tracking to generate dense spatiotemporal masks, which are used as auxiliary supervision for a diffusion-based video editing backbone. By transforming occlusion handling from explicit reconstruction into reliable anchor selection, our framework enables precise and temporally consistent editing without requiring manual annotations. Extensive experiments on challenging video editing benchmarks demonstrate the effectiveness and high-quality performance of our method.

[CV-75] IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

链接: https://arxiv.org/abs/2605.23187
作者: Lin Qian,Shijie Li,Sihao Lin,Xuan Zhang,Bangya Liu,Yanran Li,Hujun Yin
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: preprint

点击查看摘要

Abstract:Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: “I need something to warm this food” or “the room feels stuffy.” The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.

[CV-76] Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes SIGGRAPH

链接: https://arxiv.org/abs/2605.23178
作者: Wenxuan Peng,Bharath Hariharan,Hadar Averbuch-Elor
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH Conference Papers 2026. 22 pages, 12 figures. Project page: this https URL

点击查看摘要

Abstract:Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose-image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity. Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.

[CV-77] DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

链接: https://arxiv.org/abs/2605.23176
作者: Hao Vo,Khoa Vo,Phu Loc Nguyen,Sieu Tran,Duc Minh Nguyen,Ngo Xuan Cuong,Gladys Gawugah,Sreevenkata Anjani Tishita Godavarthi,Chase Rainwater,Nghi D. Q. Bui,Anh Nguyen,Duy Minh Ho Nguyen,Ngan Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

[CV-78] LQ-rPPG: A Label-Quantized Coarse-to-Fine Learning Framework for Remote Physiological Measurement

链接: https://arxiv.org/abs/2605.23174
作者: Jun Seong Lee,Samyeul Noh,Changki Sung,Hyun Myung
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables non-contact measurement of physiological signals from facial videos, offering strong potential for remote healthcare and daily health monitoring. Driven by this potential, various deep learning-based rPPG methods have been proposed to improve rPPG estimation. However, previous deep learning-based rPPG methods have paid little attention to the quality of training labels and their impact on model learning. Contact-based PPG signals used as training labels often contain noise and variability caused by motion artifacts, inconsistent sensor contact, and morphological distortions. Such label inconsistency can lead models to overfit to the label noise and variability and consequently degrade generalization performance. To address this issue, we propose LQ-rPPG, a label-quantized coarse-to-fine learning framework for robust rPPG estimation. LQ-rPPG consists of a label quantization module and a coarse-to-fine rPPG estimation model. The label quantization module transforms continuous PPG signals into multi-bit quantized pseudo labels with reduced noise and variability. The coarse-to-fine estimation model progressively refines rPPG signals under hierarchical supervision guided by the multi-bit pseudo labels. This design alleviates overfitting to label-specific variations and enables the model to learn structured and consistent representations. As a result, LQ-rPPG achieves robust and generalizable rPPG estimation even under challenging conditions. Experiments on multiple benchmark datasets demonstrate that LQ-rPPG achieves strong performance in both intra- and cross-dataset evaluations, while reducing parameters and multiply-accumulate operations by 88% and 29%, respectively, and increasing throughput by 191%. The code is available at this https URL.

[CV-79] Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping CVPR2026

链接: https://arxiv.org/abs/2605.23160
作者: Nitin Vegesna,Avideh Zakhor
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 4 tables. To be presented at the 2nd 3D-LLM/VLA Workshop at CVPR 2026 (non-archival workshop)

点击查看摘要

Abstract:We present Semantic-Aware Guided Exploration, SAGE, a system for open-vocabulary exploration in unknown 3D indoor environments that preserves coverage-oriented behavior while allowing semantic cues to reprioritize frontier selection. Building on the FALCON volumetric explorer, SAGE integrates Contrastive Language-Image Pre-training (CLIP) via four key components: object-centric embedding storage, a temporal cache that projects recent observations onto the free-unknown boundary, object frontiers for high-similarity detections, and a unified semantic-geometric planning cost. This cost function bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage. In Matterport3D-based simulations, SAGE outperforms FALCON and a semantic-only ablation in object discovery across map-query pairs. Compared to Finding Things in the Unknown (FTU), SAGE completes exploration 9.0 to 25.9 times faster across the nine shared map-query pairs, achieving a mean speedup of 13.7. Furthermore, SAGE achieves substantially higher volumetric throughput than FTU. Finally, we deploy SAGE in five real-world flights in two environments on a Modal AI Starling 2 quadrotor with onboard sensing and planning, and offboard CLIP inference. Comparing SAGE and FALCON, we find that while FALCON results in faster exploration and shorter mapping trajectories, SAGE outperforms FALCON in terms of object discovery.

[CV-80] SLIP-RS: Structured-Attribute Language-Image Pre-Training for Remote Sensing Object Detection

链接: https://arxiv.org/abs/2605.23144
作者: Chenxu Wang,Yuxuan Li,Yunheng Li,Xiang Li,Jingyuan Xia,Qibin Hou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing language-image pre-training for remote sensing object detection is constrained by Monolithic Label Learning, which relies on exhaustively enumerating open-set categories via black-box data to acquire fine-grained representations, creating a dependency incompatible with the domain’s inherent data scarcity. To transcend this bottleneck, we propose SLIP-RS, establishing a Structured-Attribute Decoupling Paradigm that maps the open-ended category space into a finite, physically meaningful attribute space, unlocking fine-grained discriminability via explicit structural logic. This paradigm is realized via two technical pillars: (1) Structured-Attribute Contrastive Learning, which enforces the learning of decoupled intrinsic visual logic via combinatorial attribute augmentation; and (2) Conformal Attribute Reliability Engine, which leverages conformal prediction theory to rigorously distill high-fidelity supervision from noisy sources, yielding RS-Attribute-15M, the largest dataset with over 15 million attribute annotations. Extensive experiments demonstrate that SLIP-RS establishes unprecedented performance in fine-grained detection and cross-domain generalization, validating structured attributes as a vital foundation for remote sensing. Code: this https URL.

[CV-81] VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images CVPR2026

链接: https://arxiv.org/abs/2605.23141
作者: Zhaonan Li,Kyle R. Chickering,Bangzheng Li,Jacob Dineen,Xiao Ye,Zhikun Xu,Shijie Lu,Yuxi Huang,Ming Shen,Bach Nguyen,Jaya Adithya Pavuluri,Mau Son Nguyen,Sanika Chavan,Ngoc Minh Thu Le,Muhao Chen,Ben Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Workshop on Visual Concepts at CVPR 2026 as a non-archival report

点击查看摘要

Abstract:A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates A!:!B::C!:,? : images B and a hidden target image D are produced by applying the same deterministic transformation sequence to source images A and C . Given A , B , and C , a model must answer a multiple-choice question about D . The benchmark contains 617 human-validated questions spanning one- to four-step transformations such as zoom, quadrant swap, rotation, flip, and hue rotation. Across strong proprietary and open-source VLMs, end-to-end accuracy is substantially lower than oracle accuracy when D is directly shown, and degrades sharply as transformation depth increases, while human performance remains near the ceiling. A program-conditioned evaluation further separates failures of relation inference from failures of transformation application, showing that inferring the visual relation from A \rightarrow B is the dominant bottleneck, with additional application errors emerging on harder multi-step cases. The dataset is publicly available at this https URL.

[CV-82] Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking MICCAI2026

链接: https://arxiv.org/abs/2605.23118
作者: Yannick Kirchhoff,Maximilian Rokuss,Daniel Philipp Mertens,David Füller,Benjamin Hamm,Andreas Schreyer,Oliver Ritter,Klaus Maier-Hein
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at MICCAI 2026

点击查看摘要

Abstract:Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration-segmentation pipelines permit user verification yet discard the lesion’s prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration-proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally-informed segmentation. To address data scarcity, we leverage large-scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out-of-distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at this https URL

[CV-83] CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection ICPR2026

链接: https://arxiv.org/abs/2605.23116
作者: Hyeongmuk Lim,Youngbum Hur
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICPR 2026

点击查看摘要

Abstract:Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: this https URL

[CV-84] Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization CVPR2026

链接: https://arxiv.org/abs/2605.23113
作者: Jiayu Xiong,Jing Wang,Qi Zhang,Wanlong Wang,Jun Xue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schrödinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schrödinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by 3%~10%, and yields improved high-precision localization, particularly for single-sided forgeries.

[CV-85] Flow Mismatching: Unsupervised Anomaly Detection via Velocity Discrepancies in Flow Matching Models

链接: https://arxiv.org/abs/2605.23070
作者: Shengzhe Chen,Mehrdad Moradi,Kamran Paynabar,Hao Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose Flow Mismatching, an unsupervised anomaly detection method that deliberately avoids reconstruction-based paradigms. Instead, we treat flow matching as geometric dynamics and leverage a key insight: anomalies occur at places where the learned normal flow disagrees with the geometric path toward a test image. Given a flow matching model trained only on normal images, we probe its learned velocity field along affine paths from Gaussian noise to a target image. Along each path, we compare the model-predicted velocity, which follows normal generative dynamics, with the geometric velocity toward the target, which includes any anomalous content. Anomalies induce strong local disagreement between these velocities. Aggregating the mismatch over different time steps and multiple paths yields pixel-wise heatmaps and image-level scores without test-time optimization, feature memories, or additional calibration. Our analysis shows that the population mismatch decomposes into an irreducible denoising term and a Fisher-divergence term between the test-path and normal-path score functions, which identifies the score-gap component that drives anomaly separation and explains the effectiveness of robust path aggregation. Extensive experiments on MVTec-AD and VisA demonstrate superior performance compared with SOTA reconstruction-based and recent flow matching-based approaches. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.23070 [cs.CV] (or arXiv:2605.23070v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.23070 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-86] RoboSurg-VQA: A Multimodal Benchmark for Surgical Segmentation-Aware Visual Question Answering

链接: https://arxiv.org/abs/2605.23068
作者: Chengyi Zhang,Zi Ye,Ziyang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable visual understanding in robot-assisted and minimally invasive surgery (RMIS/MIS) demands more than accurate masks: in clinical practice, clinicians pose language-like questions about procedural context, visibility, artefacts, and the presence of anatomical structures and surgical instruments, often under degraded views caused by occlusion, smoke, bleeding, and specular highlights. We present \textbfRoboSurg-VQA, a segmentation-aware visual question answering (VQA) benchmark built by repurposing public surgical segmentation datasets under a shared schema. Each frame is paired with a fixed set of clinically motivated questions spanning procedure context, anatomy (including region), imaging modality/view, surgical artefacts, image quality, and basic visibility and spatial attributes, with closed answer sets to enable consistent evaluation. To scale annotation, we generate candidate answers via constrained prompting with automatic validity and consistency checks, followed by human auditing to improve plausibility and label consistency. We report benchmark statistics, sanity baselines, and common evaluation challenges under challenging surgical conditions. The code will be available on this https URL.

[CV-87] Dithering Defense: Adversarial Robustness of Vision Foundation Models via Multi-Level Floyd-Steinberg Dithering ICIP2026

链接: https://arxiv.org/abs/2605.23065
作者: Yury Belousov,Brian Pulfer,Vitaliy Kinakh,Slava Voloshynovskiy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Paper accepted at the IEEE International Conference on Image Processing (ICIP 2026)

点击查看摘要

Abstract:Vision foundation models are widely used as frozen backbones across many downstream tasks, making them a single point of failure under adversarial attack. We study multi-level Floyd-Steinberg error-diffusion dithering as a lightweight, model-agnostic input transformation that disrupts adversarial perturbations while preserving semantic content. Unlike prior work, which was limited to binary dithering, grayscale CIFAR-10, and a single small model trained from scratch, we evaluate across six tasks (classification, segmentation, depth estimation, retrieval, captioning, visual question answering), two model families (DINOv2, PaliGemma), and three attacks of increasing strength (PGD, MI-FGSM, SIA), as well as an adaptive attacker using a straight-through estimator. Our results show that Floyd-Steinberg dithering at intermediate quantization levels, especially when combined with post-processing blur, exceeds or matches all tested baselines, including diffusion-based denoising, with substantially less degradation on clean inputs.

[CV-88] Millimeter-wave Imaging for Anthropometric Body Measurement

链接: https://arxiv.org/abs/2605.23064
作者: Miriam Senne,Benjamin D. Killeen,Christoph Baur,Nassir Navab,Azade Farshad
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Body shape and circumferences are clinically informative biomarkers for risk stratification, including measures such as waist to hip ratio, limb and trunk girths, yet conventional tools such as manual tape measures and optical scanners often require undressing and sustained poses. These demands slow workflows, compromise dignity, and exclude many older adults and people with limited mobility. To make measurement fast and contactless, we leverage millimeter-wave (mmWave) radar, which preserves privacy and operates through typical clothing, enabling quick full-body acquisition. In this work, we present a new optimization-based framework to recover 3D human shape and extract a comprehensive set of anthropometric measurements from volumetric mmWave data. Our method introduces a weighted registration pipeline that fits a parametric body model (SMPL) directly to the noisy mmWave point cloud. The core of our contribution is a vertex-weighting strategy that modulates a Chamfer energy function for reliable surface alignment and noise elimination. We further stabilize the fit by incorporating a foot-ground plane constraint and pose priors, optimizing directly for the SMPL parameters. Together, these components enable a fast, privacy preserving workflow that delivers high fidelity body shape and measurements through clothing without cameras or disrobing and with minimal cooperation, supporting frequent risk oriented assessments in clinics and care facilities for patients of all ages and mobility levels.

[CV-89] he TIME Machine: On The Power of Motion for Efficient Perception

链接: https://arxiv.org/abs/2605.23045
作者: Mantas Skackauskas,Xinyue Hao,Laura Sevilla-Lara
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.

[CV-90] Scene Reconstruction as Mapping Priors for 3D Detection CVPR2026

链接: https://arxiv.org/abs/2605.22997
作者: Yang Fu,Yuliang Zou,Hao Xiang,Xin Huang,Yijing Bai,Chen Song,Weijing Shi,Govind Thattai,Dragomir Anguelov,Mingxing Tan,Yingwei Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks such as 3D object detection. Maps can provide robust structural priors of the static environment, helping resolve ambiguities and correct for sensor data sparsity or noise, especially for distant objects or under adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for efficient, large-scale deployment. In this paper, we propose a scalable solution to systematically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Priors Augmented 3D Detection (MPA3D) framework to effectively integrate mapping priors with different sensor modalities. Extensive experiments on the Waymo Open Dataset demonstrate that our approach achieves new state-of-the-art results, proving the effectiveness of scalable reconstructed scene priors for enhancing 3D detection.

[CV-91] CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

链接: https://arxiv.org/abs/2605.22996
作者: Adil Meric,Lin Geng Foo,Mert Kiray,Benjamin Busam,Rishabh Dabral,Christian Theobalt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule. Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify which layers are responsible for the motion generation. Therefore, we propose a novel way to determine “Motion Layers” operating in the attention space of MMDiT. We fine-tune the model by using Low-Rank Adaptation (LoRA) to the Motion Layers, without requiring any architecture change in the MMDiT. This selective adaptation enables our method to focus on motion-critical components, yielding reduced computational cost. Despite its simplicity, CoMoGen enables precise subject motion and plausible interactions with surrounding humans, objects, and scenes. Comprehensive experiments on different datasets show that CoMoGen consistently outperforms prior controllable video generation methods and achieves state-of-the-art performance in motion fidelity and perceptual realism. Project page: this http URL.

[CV-92] Improved Vision-to-Chart Buoy Association with Learned World-to-Image Projection CVPR2026

链接: https://arxiv.org/abs/2605.22942
作者: Borja Carrillo-Perez(Arquimea Research Center)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures. Technical report for the MaCVi 2026 Vision-to-Chart Data Association Challenge at the CVPR 2026 Workshop; 2nd place submission. Code: this https URL

点击查看摘要

Abstract:This report presents a lightweight modification to the DETR-based fusion transformer baseline for the MaCVi 2026 Vision-to-Chart data association challenge. The challenge baseline decoder receives per-buoy queries encoding world-space distance and bearing, forcing the transformer to implicitly learn the complex geometric projection from world coordinates to image pixels. Instead, this work trains an additional dedicated MLP, QueryMLP, to explicitly predict the buoy’s waterline contact point in the image from chart measurements and IMU orientation data. The predicted pixel coordinates are appended to the baseline decoder query vector, providing a direct spatial prior per buoy and reducing the geometric reasoning burden on the transformer decoder. On the challenge leaderboard, the presented approach achieves an Overall score of 0.7386, with F1 = 0.8055 and mIoU = 0.6718, on the held-out test set, placing second among all submissions.

[CV-93] VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

链接: https://arxiv.org/abs/2605.22907
作者: Haichen He,Jiayi Zhou,Sifeng Shang,Yihan Hu,Yuanhan Zhang,Kaiyang Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world long video understanding requires models to perform continuous tracking, information integration and memory retention over massive temporal spans within extreme video durations. Mastering this intense cognitive load constitutes the fundamental bottleneck in long video understanding. While existing benchmarks have driven progress by scaling up video duration, their evaluation tasks often require comprehending only short and isolated video segments, falling short of capturing the challenge of ultra-long-context reasoning. To measure this cognitive load, we emphasize continuous certificate length, defined as the video length a human must continuously watch to definitively answer a given question. Driven by this metric, we introduce VideoOdyssey, a benchmark specifically designed for ultra-long-context and omni-modal video understanding. VideoOdyssey is characterized by three key features: 1) Extreme video duration and diversity: spanning 11 domains and 54 subcategories with an average video duration of 109 minutes; 2) Comprehensive evaluation scenarios: offering two subsets to address different research focuses, i.e., VideoOdyssey-V for probing the limits of visual understanding in MLLMs, and VideoOdyssey-AV for evaluating synchronized audio-visual understanding for omni-modal models; 3) Ultra-long and multi-level continuous certificates: extending the average continuous certificate to 16 minutes for VideoOdyssey-V and 12.8 minutes for VideoOdyssey-AV. Crucially, we design 5 granular levels from seconds to hours, providing a comprehensive diagnostic tool to evaluate models across varying context lengths and cognitive loads. Extensive evaluations show that bottlenecks of current MLLMs extend beyond simple retrieval to include struggles with continuous reasoning across varying context lengths, fine-grained perception, and non-verbal omni-modal understanding.

[CV-94] Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations IJCAI26

链接: https://arxiv.org/abs/2605.22904
作者: Safwen Naimi,Wassim Bouachir,Guillaume-Alexandre Bilodeau,Brian Mishara
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, 1 table. Accpted for Publication in IJCAI 26

点击查看摘要

Abstract:Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory-driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC-AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.

[CV-95] Extending Deep Event Visual Odometry with Sparse Point-Cloud Export

链接: https://arxiv.org/abs/2605.22890
作者: Alireza Safdari,Sajad Ashraf
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 Pages, 4 figures, 5 tabel

点击查看摘要

Abstract:Event cameras are well suited for visual odometry under high-speed motion and challenging lighting conditions due to their low latency, high temporal resolution, and high dynamic range. Deep Event Visual Odometry (DEVO) demonstrated that monocular event-only odometry can achieve strong performance by combining sparse patch tracking, learned patch selection, recurrent correspondence refinement, and differentiable bundle adjustment. In this project, we extend DEVO with a sparse point-cloud export pipeline. Rather than modifying the core odometry formulation, our approach exposes the internal 3D structure already estimated by DEVO and converts it into an explicit point-cloud representation for visualization and further processing. In addition, we implement a practical workflow for data export, format conversion, and point-cloud cleanup. The resulting system preserves the original visual odometry pipeline while enabling sparse geometric scene output. Experiments on the BOARD SLOW sequence show that the exported sparse cloud is locally consistent with EMVS reconstructions, achieving high precision at a 5 cm threshold, while also highlighting the expected limitations in density, completeness, and sensitivity to accumulated odometry noise.

[CV-96] GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

链接: https://arxiv.org/abs/2605.22882
作者: Kaichen Zhou,Yuzhen Chen,Fangneng Zhan,Hang Hua,Grace Chen,Xinhai Chang,Ao Qu,Yilun Du,Zhuang Liu,Paul Pu Liang,Mengyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Robotic World Model, Video Generative Model

点击查看摘要

Abstract:Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at the project page: this https URL.

[CV-97] MedExpMem: Adapting Experience Memory for Differential Diagnosis MICCAI2026

链接: https://arxiv.org/abs/2605.22872
作者: Qianhan Feng,Zhongzhen Huang,Yakun Zhu,Yannian Gu,Winnie Chiu Wing Chu,Xiaofan Zhang,Qi Dou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2026 Early Accept. Submission Version

点击查看摘要

Abstract:Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to differentiate confusable conditions. Current medical vision-language models (VLMs) lack this capability – their parameters encode static knowledge that does not evolve across diagnostic encounters. We propose MedExpMem, an experience memory framework enabling VLM-based diagnostic agents to accumulate differential diagnosis expertise. Unlike retrieval-augmented generation, which retrieves encyclopedic disease descriptions, MedExpMem memorizes discriminative experience derived from the agent’s own diagnostic failures and organizes them as pairwise differential notes encoding key discriminators, actionable decision rules and reasoning error patterns. The framework adopts a two-phase construction process mirroring physician learning: initial practice exposes knowledge gaps, and reflective re-diagnosis refines understanding. When encountering new cases, the agent retrieves experience memory to guide differential reasoning. We evaluate MedExpMem on a radiology benchmark spanning 11 subspecialties. Results demonstrate consistent accuracy improvements, maximum 7.0%, across diverse models and scales. Analytical experiments validate experience quality and robustness, demonstrating MedExpMem as a competitive method addresses medical adaptation needs beyond the reach of parameteric learning.

[CV-98] Efficient Learned Image Compression without Entropy Coding ICML2026

链接: https://arxiv.org/abs/2605.23323
作者: Hao Cao,Wenqi Guo,Zhijin Qin,Jungong Han
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Entropy coding is widely used in typical learned image compression (LIC) that converts latents into a compact bitstream. However, entropy coding is typically sequential and becomes the coding latency bottleneck. To overcome it, we present Entropy-Coding Free Learned Image Compression (EF-LIC), a multi-rate framework that generates compact representation by removing statistical and correlation redundancy with low coding latency. First, we introduce unconstrained vector quantization and prove that its index distribution approaches the maximum-entropy bound, yielding minimal statistical redundancy. Second, we propose a context-conditioned autoregressive transform that directly reparameterizes the latents to reduce inter-dependency. Theoretical analysis shows that EF-LIC can remove correlation redundancy as effectively as typical LIC with entropy coding, leading to comparable compression performance. Experiments show EF-LIC achieves up to 67.86% bitrate reduction over MS-ILLM on Kodak with LPIPS. Ablation studies further show EF-LIC matches the compression performance of its entropy-coding based variant while achieving over 3\times faster encoding and 5\times faster decoding.

[CV-99] Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring ICML2026

链接: https://arxiv.org/abs/2605.23282
作者: Shaoqing Duan,Haofei Song,Xintian Mao,Qingli Li,Yan Wang
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 9 figures. Accepted by ICML 2026

点击查看摘要

Abstract:Defocus deblurring in pathological microscopy remains challenging due to the spatially varying and locally discontinuous nature of optical blur induced by a position-dependent integral imaging process. Existing deep learning methods, constrained by shift-invariance assumptions and limited interpretability, are not well suited to such heterogeneous blur patterns. Neural operators provide a principled alternative by modeling defocus formation directly as an integral operator, offering a new perspective on defocus deblurring. However, most existing neural operator architectures for low-level vision rely on globally parameterized kernels that assume smoothness and stationarity, limiting their ability to model heterogeneous and locally discontinuous blur patterns. To address this limitation, we propose the Discontinuous Galerkin Neural Operator (DGNO), which parameterizes the integral kernel using a discontinuous Galerkin formulation with element-local volume operators and interface numerical fluxes. DGNO provides a principled combination of locality, heterogeneity modeling, and global coherence while preserving the underlying physics of optical image formation. Extensive and insightful experiments demonstrate that DGNO surpasses state-of-the-arts, delivering sharper reconstructions, robust handling of spatially varying blur, and scalable high-resolution performance. The code will be released at this https URL. Comments: 17 pages, 9 figures. Accepted by ICML 2026 Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2605.23282 [eess.IV] (or arXiv:2605.23282v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2605.23282 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shaoqing Duan [view email] [v1] Fri, 22 May 2026 06:50:26 UTC (4,679 KB) Full-text links: Access Paper: View a PDF of the paper titled Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring, by Shaoqing Duan and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: eess.IV prev | next new | recent | 2026-05 Change to browse by: cs cs.CV cs.LG eess References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-100] GMENet: Generative Mixture of Experts Network for Multi-Center Glioma Diagnosis with Incomplete Imaging Sequences IJCAI

链接: https://arxiv.org/abs/2605.23183
作者: Pengfei Song,Fangjin Liu,Wenwen Zeng,Yonghuang Wu,Chengqian Zhao,Feiyu Yin,Xuan Xie,Jinhua Yu
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: IJCAI Accept

点击查看摘要

Abstract:Contemporary glioma diagnosis integrates molecular features with histopathology to guide clinical decision-making. However, in clinical settings, divergent imaging protocols result in incomplete MRI sequences, leading to two primary challenges: forcing existing frameworks to discard a large portion of clinical data during training and consequently limiting their clinical applicability. To address these limitations, we propose GMENet, a Generative Mixture of Experts Network for multi-center glioma diagnosis with incomplete imaging sequences. Firstly, we design a Cross-attention-based Gated Generation Module that synthesizes missing sequence features from available sequences via cross-attention and dynamic gating mechanisms, incorporating a cycle-consistency loss to preserve semantic integrity. Secondly, we introduce a Dynamically Weighted Experts Fusion Module that performs mixture-of-experts interaction and confidence-aware fusion over original and synthesized dual-sequence features for multi-task prediction. We evaluate GMENet on a multi-center cohort of 1,241 subjects from four in-house datasets and two public repositories. Experiments show that GMENet expands clinically usable training data by 97%, relative to complete-sequence-only data. Furthermore, it consistently outperforms state-of-the-art methods trained on complete data, demonstrating improved robustness under cross-center distribution shifts.

[CV-101] STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding

链接: https://arxiv.org/abs/2605.23137
作者: Jiahe Meng,Weiming Zeng,Yueyang Li,Bo Chai,Hongjie Yan,Zhiguo Zhang,Wai Ting Siok,Nizhuan Wang
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Electroencephalography (EEG) visual decoding remains challenging due to the modality gap between low-SNR neural signals and highly structured vision–language spaces, making direct cross-modal alignment unstable. To address this, we propose STAMBRIDGE, a versatile two-stage framework that sequentially tackles feature conditioning and cross-modal alignment. First, we introduce a Spectral-Temporal Amplitude-aware Modulation (STAM) to extract well-conditioned EEG representations. By replacing hard frequency masking with amplitude-derived soft channel weighting and multi-scale temporal convolutions, STAM explicitly preserves frequency-aware transients while reducing the risk of time-domain ringing artifacts. Building upon these robust neural features, we further introduce a model-agnostic Mid-Feature Semantic Bridge (MFSB) that constructs a regularized intermediate space through directed cross-modal interactions, enabling staged distillation and more stable semantic alignment. Experiments on the THINGS-EEG benchmark show competitive 200-way zero-shot retrieval performance, with 34.50% Top-1 and 65.95% Top-5 accuracy. In addition, embeddings learned by STAMBRIDGE produce semantically coherent image reconstructions with a diffusion model, demonstrating robust EEG-to-vision semantic alignment. The code is available at: this https URL.

[CV-102] Do Synthetic Brain MRIs Reliably Improve Tumour Classification? A StyleGAN2-ADA Class-Plane Augmentation Study on BRISC 2025

链接: https://arxiv.org/abs/2605.23094
作者: José Rafael Noriega Cedeño
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 16 figures

点击查看摘要

Abstract:Generative augmentation is often proposed as a remedy for small medical-image datasets, but synthetic images are only useful when they improve downstream task performance. “Augmentation” here means synthetic supplementation: GAN-generated samples added to the real training pool, not geometric or photometric transforms of existing images. Twelve class-plane StyleGAN2-ADA generators were trained on constrained BRISC 2025 partitions to test whether their output, with or without InceptionV3 feature-space filtering, improves held-out tumour classification across three classifier families: a random forest (RF) on InceptionV3 features, a compact two-headed convolutional neural network (CNN), and MobileViTV2, a mobile hybrid convolutional-transformer. Each was evaluated at 1:1 and 1:2 real-to-synthetic ratios. An independent GPT-5.5 blind test placed gated real-versus-synthetic discrimination at 57.73% (95% CI: 54.48–60.92%) on the model-legible subset – modestly above chance. The RF classifier did not benefit from the synthetic MRIs. The CNN showed consistent mean gains that did not survive Holm correction. MobileViTV2 showed the clearest benefit: filtered 1:1 augmentation improved tumour classification accuracy by 1.02% absolute (95% CI: 0.54–1.54%; Holm-corrected p = 0.0104). A secondary efficiency analysis found that every augmented CNN condition selected its checkpoint 42–64% earlier than baseline, while compute-matched MobileViTV2 runs reached selection after 50–67% fewer real-data epochs. Overall, augmentation utility was found to be architecture- and ratio-dependent, not guaranteed by visual fidelity alone.

人工智能

[AI-0] LLM s as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws ICML2026

链接: https://arxiv.org/abs/2605.23901
作者: Xu Ouyang,Deyi Liu,Yuhang Cai,Jing Liu,Yuan Yang,Chen Zheng,Thomas Hartvigsen,Yiyuan Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R^2 scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on \leq 6.9B Pythia models with \leq 180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled R^2=0.847 , while monotonic baselines collapse. Comments: Accepted by ICML 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT) Cite as: arXiv:2605.23901 [cs.LG] (or arXiv:2605.23901v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.23901 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

链接: https://arxiv.org/abs/2605.23899
作者: Zisu Huang,Jingwen Xu,Yifan Yang,Ziyang Gong,Qihao Yang,Muzhao Tian,Xiaohua Wang,Changze Lv,Xuemei Gao,Qi Dai,Bei Liu,Kai Qiu,Xue Yang,Dongdong Chen,Xiaoqing Zheng,Chong Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language agents increasingly improve by reusing \emphskills – structured procedural artifacts distilled from past experience. In particular, \emphdomain-level and \emphmodel-generated skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle – \textbfexperience generation, \textbfskill extraction, and \textbfskill consumption – to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete \emphmeta-skill that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.

[AI-2] SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

链接: https://arxiv.org/abs/2605.23898
作者: Jianshu Zhang,Yijiang Li,Huifeixin Chen,Haoran Lu,Letian Xue,Bingyang Wang,Han Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.

[AI-3] Its the humans not the data: Geopolitical bias in LLM s originates in post-training amplified by the language of the prompt

链接: https://arxiv.org/abs/2605.23825
作者: Stuart Bladon,Brinnae Bent
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 2 tables, 3 appendices. Code and scenario bank: this https URL

点击查看摘要

Abstract:It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phase. We tested seven open-weight LLM pairs consisting of the base model (pre-training only) and the chat model (pre-training and post-training) from seven labs on a paired-scenario forced-choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post-training rather than in pre-training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post-training. This shift is strongest in Alibaba’s Qwen 2.5: while the base is neutral on China-favourability (-0.15 log-odds, p=0.15), the post-trained chat variant is at +2.91 (p10^-4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French-made Mistral becomes pro-France only under French prompting (FR-EN shift +1.91, p10^-4). These findings suggest that geopolitical preferences in language models are not simply inherited from large-scale internet data but are actively shaped during post-training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.

[AI-4] Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

链接: https://arxiv.org/abs/2605.23780
作者: Haoyuan Wang,Xiaohao Liu,Jiajie Su,Jianmao Xiao,Chaochao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) need efficient mechanisms to update knowledge without degrading existing capabilities. While intrinsic multimodal knowledge editing achieves strong reliability and locality, it often exhibits limited generality, failing to propagate edits across semantically equivalent visual and linguistic variations. This issue arises from the lack of explicit semantic supervision, rigid editing scopes, and biased anchoring to individual samples in high-dimensional multimodal spaces. We address robust intrinsic multimodal knowledge editing by explicitly targeting generalization. We formalize robustness through knowledge units that group semantically equivalent multimodal inputs and define generality as consistent predictions within each unit. To expose fragile semantic regions, we introduce Latent Adversarial Robustification (LAR), which generates adversarial yet semantically coherent variants in the joint latent space. We further propose Rank-Constrained Subspace Learning (RCSL), enforcing low-rank alignment of adversarial representations at the edit layer via a singular value-based objective. Extensive analysis demonstrates the effectiveness of ASAM empirically.

[AI-5] Agent ic Proving for Program Verification

链接: https://arxiv.org/abs/2605.23772
作者: Alessandro Sosso,Akhil Arora,Bas Spitters
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER’s isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

[AI-6] Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

链接: https://arxiv.org/abs/2605.23733
作者: Ming Yang,Tao Yu,Feng Li,Hua Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots.

[AI-7] MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

链接: https://arxiv.org/abs/2605.23723
作者: Zhewen Tan,Yilun Yao,Huiyan Jin,Wenhan Yu,Guoan Wang,Mengyuan Fan,liang lu,Feng Liu,Xiangzheng Zhang,Duohe Ma,Tong Yang,Lin Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent’s memory through ordinary interaction, and these records can later be retrieved to steer the agent’s reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbfMemAudit, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory’s causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from 70% to 0% , while RAP attack success drops from 83.3% to 0% .

[AI-8] One Policy Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

链接: https://arxiv.org/abs/2605.23652
作者: Yoosung Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 15 figures, 14 tables

点击查看摘要

Abstract:On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. Life simulation games require hundreds to thousands of non-player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer-authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real-time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. pcsp combines once-per-NPC persona encoding, low-rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory-consistency objective is load bearing: removing it collapses zero-shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona-conditioned behavioral divergence in multi-agent strategic environments. We distinguish two senses of held-out evaluation: compositional zero-shot and vocabulary-expansion held-out. Finally, a UE5 deployment reproduces the in-engine persona-conditioning ablation at 64 agents with a low failure rate, showing that the sub-frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real-time, persona-conditioned NPC control.

[AI-9] Learning Through Noise: Why Subliminal Learning Works and When It Fails

链接: https://arxiv.org/abs/2605.23645
作者: Vincent C. Brockers,Roman D. Ventzke,Valentin Neuhaus,Belén Hidalgo-Ogalde,Viola Priesemann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input \unicodex2013 output pairs. Prior explanations tie this effect to shared or closely matched teacher \unicodex2013 student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs \unicodex2014 even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student’s representations closer to the teacher’s. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.

[AI-10] Adversarial Vulnerability Under Temporal Concept Drift: A Longitudinal Study of Android Malware Detection

链接: https://arxiv.org/abs/2605.23623
作者: Ahmed Sabbah,Mohammed Kharma,Radi Jarrar,Samer Zein,David Mohaisen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 42 pages, 4 tables, 10 figures

点击查看摘要

Abstract:We present a longitudinal, drift-aware evaluation of adversarial robustness across more than a decade of Android applications using static and dynamic feature representations extracted from emulator and real-device executions. The dataset is organized into yearly slices and evaluated under three deployment protocols that emulate realistic learning scenarios: (1) same-year training and testing, (2) cross-year deployment without model updates, and (3) expanding-window retraining with cumulative historical data. Across multiple classifier families, adversarial examples are generated using FGSM and SPSA under feasibility constraints. We measure clean performance, Adversarial Accuracy (AA), Attack Success Rate (ASR), and introduce temporal linkage metrics – RobustDrop, \Delta ASR, and Adversarial Amplification Factor (AAF) – to quantify the relationship between distribution shift and robustness this http URL show that temporal separation is associated with reduced adversarial robustness under the evaluated transfer-based feature-space setting. As the train-test gap increases, clean accuracy and adversarial accuracy decline, while attack success exhibits configuration-dependent increases, particularly under FGSM perturbations and static features. Expanding-window retraining mitigates, but does not eliminate, robustness loss under continued distributional evolution. These findings indicate that temporal drift should be considered when assessing the long-term robustness of intelligent detection systems under evolving data distributions and highlight the need for drift-aware robustness assessment frameworks in long-lived adversarial environments.

[AI-11] Preisach Attention: A Hysteretic Model of Sequential Memory

链接: https://arxiv.org/abs/2605.23603
作者: Piotr Frydrych
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 24 pages, 2 tables, preprint

点击查看摘要

Abstract:We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis operator from mathematical physics. PAL replaces the softmax attention mechanism with a binary relay operator parameterised by learned activation and deactivation thresholds, maintaining a stack of local extrema as its internal state. A single-layer PAL-Transformer with O(1) depth is Turing-complete under arbitrary precision arithmetic, achievable through simulation of a two-stack pushdown automaton – in contrast to the O(log n) depth required by standard hard-attention transformers. Second, we prove that the function classes computable by PAL and by the transformer are incomparable: PAL computes historical range statistics in O(1) layers that require O(log n) layers for transformers, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The separating property is rate-independence – PAL responds only to the sequence of local extrema, not to absolute token positions or temporal spacing. Third, we show that the extremum stack constitutes a minimal sufficient statistic of the input history for all rate-independent functionals, providing a formal analogue of the wiping property in classical hysteresis theory. PAL is thus an efficient architecture for tasks with long episodic memory and weak positional dependence, with O(n log n) total inference cost versus O(n^2) for standard attention.

[AI-12] Solving the Aircraft Disassembly Scheduling Problem

链接: https://arxiv.org/abs/2605.23592
作者: Charles Thomas,Pierre Schaus
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dismantling aircrafts reaching their end of life is a complex endeavour that is necessary in terms of sustainability but yields small income margins for air transport companies. An efficient scheduling of the disassembly procedure is thus crucial to ensure the profitability of the process and incentivize practice. This is a large scheduling problem that involves thousands of tasks and many different constraints: Extracting parts that are destined to be reused requires technicians with specific certifications and equipment. Extraction operations might be subject to precedence relations. Furthermore, the aircraft must be kept balanced during the whole process. Finally, some of the locations of the aircraft have a limited space that caps the number of technicians able to work there concurrently. This article presents the problem in details and proposes two approaches to solve the problem: a Constraint Programming model and a MIP model. The models are tested on instances of varying sizes involving up to 1450 tasks, which are based on real operational data provided by an industrial partner.

[AI-13] Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

链接: https://arxiv.org/abs/2605.23590
作者: Jiazheng Kang,Bowen Zhang,Zixin Song,Jiangwang Chen,Xiao Yang,Da Zhu,Guanjun Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent’s context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at this https URL.

[AI-14] CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

链接: https://arxiv.org/abs/2605.23569
作者: Emma Legrand,Roger Kameugne,Pierre Schaus
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic Programming (DP) and Constraint Programming (CP) are well-established paradigms for solving combinatorial optimization problems. Usually, these two approaches are used separately. This paper aims to show that the two can be combined effectively and elegantly, with DP serving as the primary search framework and CP used as a subroutine to leverage global constraint propagation. This paper presents such an approach for the Partial Shop Scheduling Problem (PSSP), for which a pure DP method has previously been proposed, and efficient CP filtering algorithms are available. The PSSP is a general scheduling problem where each job consists of a set of operations with arbitrary precedence constraints. The approach is flexible enough to accommodate anytime DP strategies, such as anytime column search, whereas the original DP algorithm operated in a strictly layer-wise manner. Moreover, the flexibility of the CP modeling makes it straightforward to incorporate arbitrary precedence constraints. As a result, the model naturally handles any precedence graph and even enables the design of a Large Neighborhood Search (LNS) scheme, in which the DP model is reused, and partial-order schedules are imposed across restarts to improve the incumbent solution. While not competitive with state-of-the-art pure CP solvers for this specific problem, our primary contribution is demonstrating the viability of this hybrid integration.

[AI-15] Understanding Goal Generalisation in Sequential Reinforcement Learning

链接: https://arxiv.org/abs/2605.23565
作者: Jason Ross Brown,Edward James Young
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

[AI-16] Goal-Conditioned Agents that Learn Everything All at Once

链接: https://arxiv.org/abs/2605.23551
作者: Michael Matthews,Matthew Jackson,Michael Beukman,Thomas Foster,Alistair Letcher,Scott Fujimoto,Cédric Colas,Jakob Foerster
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a 250x speed-up compared to all-goals relabelling. We then go on to show that this approach can be made even more powerful by using LEO as a teacher network, rather than a direct actor. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex environments. We open source our code.

[AI-17] VACE: Learning Geometrically Structured Representations for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2605.23504
作者: Alberto D. Cencillo,Leonardo Concepción,Isaac Triguero,Julián Luengo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Anomaly detection in multivariate time series is a critical task across a wide range of real-world applications, where abnormal behaviour is rare, labels are unavailable, and the cost of a miss is high. The central challenge is learning a characterisation of normality precise enough to flag deviations. Representation self-supervised learning, typically through contrastive approaches, addresses this by embedding temporal patches into a latent space where normality occupies a well-defined region, with anomalies detected by geometric deviation. However, contrastive approaches shape this space indirectly through pair-sampling heuristics, providing no explicit control over the geometric structure that distance-based scoring requires. This means how tightly normal representations are grouped, and whether distances are directionally meaningful. We present VACE (Velocity-Aligned Channel Embeddings), a self-supervised anomaly detection method that represents normality as a compact, directionally coherent region in the embedding space. To this end, VACE trains a channel-aware encoder through a velocity-consistency objective, with no negatives and no synthetic anomalies, so that normal trajectories are locally smooth and aligned. At test time, a Mahalanobis positional score and a velocity-bank directional score are combined multiplicatively, flagging points that are simultaneously off-distribution and dynamically atypical. Despite its simplicity, VACE achieves state-of-the-art performance on TSB-AD-M under rigorous evaluation, significantly outperforming more complex methods trained on substantially larger budgets.

[AI-18] EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

链接: https://arxiv.org/abs/2605.23493
作者: Aristotelis Lazaridis,Dylan Bates,Aman Sharma,Brian King,Vincent Lu,Jack FitzGerald
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an efficient use-case of OPD, which is appealing as it requires only a single model as a student and teacher, and it also has the benefit of providing privileged context that is a absent at inference time (e.g. a persona, a private fact, or a worked solution) to the teacher during the training process. The challenge in this approach is that the privileged information can change model behavior more than intended: it can modify reasoning, degrade general capabilities, and affect performance indicators like response length, style, or local token preferences. Consequently, OPSD may train the student on side effects rather than a desired, transferable behavior. In this paper, we study this problem in a rare-token/identity setting and propose EviDence GuidEd On-Policy Distillation (EDGE-OPD), a modification of OPSD with two distinct characteristics: a) it uses guided rollouts to inject privileged-context behavior to the student at sampling time, so that the rare target behavior is actually present in the on-policy data, and b) it applies an evidence mask: the student is updated only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout. We empirically show that OPSD (and its variant RLSD, with and without a verifier) completely fail to learn a target identity, while the integration of guided rollouts allows them to succeed. Additionally, mask-region ablations show that the persona signal is localized to the positive-evidence tail, allows us to draw valuable insights about efficient knowledge transfer and preservation of general purpose capabilities.

[AI-19] Automated Random Embedding for Practical Bayesian Optimization with Unknown Effective Dimension ECAI2026 IJCAI

链接: https://arxiv.org/abs/2605.23473
作者: Hong Qian,Xiang Shu,Xiang Xia,Xuhui Liu,Yangde Fu,Bei Liang,Huibin Wang,Liang Dou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by IJCAI-ECAI 2026

点击查看摘要

Abstract:Bayesian optimization is widely employed for optimizing complex black-box functions but struggles with the curse of dimensionality. Random embedding, as a dimension reduction strategy, simplifies tasks that possess the effective dimension by optimizing within a low-dimensional subspace. However, determining the effective dimension of a task in advance remains a significant challenge, which influences the selection of the subspace dimensionality and the optimization performance. Traditional methods use fixed subspace dimensions provided by experts or rely on trial and error to estimate subspace dimensions with resources consumed. To this end, this paper proposes an automated random embedding for high-dimensional Bayesian optimization with unknown effective dimension, called Dynamic Shared Embedding Bayesian Optimization (DSEBO). DSEBO starts with a low dimension and switches to a higher subspace if the solutions in the current subspace show preliminary convergence. DSEBO dynamically determines the dimension of the next subspace based on the quality of the solutions in different subspaces and shares the queried solutions with the new subspace for a better initialization. Theoretically, we derive a regret bound for DSEBO and demonstrate that DSEBO can better balance approximation and optimization errors. Extensive experiments on functions with dimensionality of varying magnitudes and real-world tasks with unknown effective dimensions reveal that, compared with state-of-the-art methods, alternating optimization across different subspaces results in significant improvements in high-dimensional optimization, both in terms of optimization regret and time.

[AI-20] CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection IJCNN

链接: https://arxiv.org/abs/2605.23471
作者: Hanadi Alhamdan,Ghadah Alosaimi,Amir Atapour-Abarghouei,Farshad Arvin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 4 tables. Submitted to IJCNN/WCCI 2026. CBANet: A compact attention-based CNN-BiLSTM framework for aggressive driving event detection using multivariate vehicle dynamics signals. Code available at this https URL

点击查看摘要

Abstract:Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their performance in real-world conditions is often limited by severe data imbalance, large variability between drivers, and the lack of physically interpretable vehicle dynamics representations. In this paper, we propose an enhanced deep learning framework for aggressive driving detection using multivariate vehicle dynamics signals. Instead of relying solely on raw measurements, the proposed approach constructs engineered dynamic features that capture steering, acceleration, and braking behaviour. To address the extreme rarity of aggressive events in naturalistic driving data, we introduce a stable training strategy that combines controlled SMOTE-based oversampling with a class-weighted loss formulation, and evaluates focal loss variants for imbalance handling. Furthermore, a safety-oriented decision strategy based on class-specific threshold calibration is adopted to better reflect the asymmetric risks of missed detections and false alarms in real-world applications. The proposed framework is evaluated on a newly collected naturalistic driving dataset. Extensive experiments show that the proposed method consistently outperforms standard deep learning baselines with significant improvements in minority-class recall and safety-critical F-score metrics while maintaining practical computational efficiency. Code: \url this https URL

[AI-21] Learning Individual Dynamics from Sparse Cross-Sectional Snapshots

链接: https://arxiv.org/abs/2605.23470
作者: Christian Lagemann,Kai Lagemann,Steven L. Brunton,Sach Mukherjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Predicting how a dynamical unit evolves over time - how an individual ages, an epidemic spreads, or a physical system degrades - typically requires dense longitudinal tracking. When only extremely sparse or entirely cross-sectional data is available, inferring individualized, continuous-time trajectories is fundamentally ill-posed. Existing methods force a strict compromise: sequence models (e.g. latent ODEs) require dense longitudinal data, while cross-sectional methods (e.g. optimal transport, flow matching-based) map aggregate populations, losing individual dynamics. In this paper, we demonstrate that this dichotomy can be broken. We introduce CADENCE, a principled probabilistic framework that recovers continuous individual trajectories from isolated snapshots by anchoring latent dynamics to static, individual-level contexts. We provide novel identifiability guarantees for single-timepoint trajectory inference. By combining a score-based spatial encoder (bijective Probability Flow ODE) to eliminate diffeomorphic ambiguities with a Soft Mixture-of-Experts (SMoE) router, we show that individual dynamical parameters and routing function are jointly identifiable. Across a suite of benchmarks spanning physical systems to real-world biological data, CADENCE, trained strictly on extremely sparse snapshots with context structure, matches or exceeds the performance of state-of-the-art sequential models trained on dense, full-trajectory data.

[AI-22] AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems

链接: https://arxiv.org/abs/2605.23459
作者: Chitra Badagi,Divye Singh,Animesh Sen,Adinath Shirsath
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enterprise AI systems, built on large language models, retrieval pipelines and autonomous agents, introduce a class of risks that traditional software quality assurance was never designed to address. These systems are probabilistic, context-sensitive and emergent: they cannot be verified to be correct in the classical sense, but only evaluated with increasing confidence. This paper presents a comprehensive assurance strategy for enterprise AI systems built around three key principles: first, that AI testing should focus on continuous risk reduction rather than strict correctness verification; second, that evaluation must be treated as a core engineering discipline alongside development; and third, that failures in AI assurance can lead to organizational impacts that are fundamentally different from those seen in traditional deterministic software systems. We introduce a structured AI Failure Taxonomy, propose a revised five-layer AI Assurance Pyramid and provide operational guidance on evaluation-driven development, RAG system testing, model lifecycle management and governance. The goal is to equip engineering leaders and practitioners with a strategy that is both philosophically grounded and operationally deployable.

[AI-23] AI Security Research Should Better Incentivize Defense Research

链接: https://arxiv.org/abs/2605.23448
作者: Youqian Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages,3 figures,3 tables

点击查看摘要

Abstract:This work examines an imbalance in artificial intelligence (AI) security research: the field tends to produce more work on attacking AI systems than on defending them. Drawing on related academic papers, we find biased attack-to-defense ratios across subfields, including federated learning, speech recognition, membership inference, large language models, etc. The imbalance possibly means far beyond a simple count: attack papers are routinely evaluated under favorable conditions that make threats look more severe than they are in practice, while defenses are held to a stricter standard that few can meet. The result is a literature rich in demonstrated vulnerabilities and thin on usable and deployed protections. We thus argue that AI security research should better incentivize defense research.

[AI-24] Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

链接: https://arxiv.org/abs/2605.23415
作者: Shuai Zhen,Yifan Zhang,Yuling Wang,Yanhua Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ( G -invariant MDPs). Existing works in this direction have primarily focused on image-based RL and rotational symmetry such as \mathrmSO(2) , leaving state-based RL and reflection symmetry largely underexplored. In this work, we focus on state-based continuous control tasks and exploit reflection symmetry by introducing Reflex, a paradigm that seamlessly integrates with both on-policy and off-policy RL algorithms. We formalize two types of reflection-axial reflection and bilateral reflection, and characterize their corresponding transformations. Building on a theoretical analysis of symmetry-preserving optimal value functions and policies, Reflex integrates reflection symmetry into policy learning through principled symmetry regularization mechanisms. We integrate Reflex with PPO and SAC, and evaluate it on a suite of OpenAI Gym and DeepMind Control benchmarks, demonstrating superior performance over standard baselines while improving sample efficiency. Our code is available at this https URL.

[AI-25] When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM -Based Multi-Agent Systems

链接: https://arxiv.org/abs/2605.23414
作者: Zehao Wang,Shilong Jin,Zhao Cao,Lanjun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-based multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self-consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC-AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC-AW employs Information-consistency-based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency-guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC-AW improves system-level success by an average of 9.75%.

[AI-26] Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting ICML2026

链接: https://arxiv.org/abs/2605.23402
作者: Jinglin Li,Jun Tan,QI Fang,Ning Gui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 8 figures, accepted by ICML 2026

点击查看摘要

Abstract:Effectively modeling non-stationary dynamics in probabilistic multivariate time series(MTS) forecasting requires balancing expressiveness with robustness. Existing parametric approaches benefit from strong inductive biases but lack flexibility, whereas deep generative models struggle to capture complex temporal dependencies without extensive data and computation. We introduce Parametric Prior Mapping (PPM), a framework that injects parametric structural priors into a generative modeling process. Specifically, PPM utilizes a parametric estimator to derive a dynamic, adaptive prior that guides the learning of a complex predictive distribution via a learnable mapping. This design allows the model to retain the efficiency of parametric methods while exploiting the expressive power of generative models. Trained with a hybrid objective, PPM yields precise forecasts with well-calibrated uncertainty estimates. Empirical results show that PPM outperforms existing baselines in handling non-stationary data, offering a superior trade-off between accuracy and computational efficiency. The code is available at this https URL.

[AI-27] Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

链接: https://arxiv.org/abs/2605.23393
作者: Po-Kai Chen,Niki van Stein,Aske Plaat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key-value template \phi(S)U . We exploit this structure to develop Unpack, a backward recursion that decomposes credit through both sublayers, producing interaction strengths between any two components, named end-to-end paths with K/Q/V composition labels, and per-token attribution from a single forward pass, without intervention, gradients, or auxiliary training. We evaluate on the indirect object identification task. On GPT-2 small, the method recovers all three composition connections described by Wang et al. (2023), including the mode-specific routing of each connection (K, Q, or V). To test token-level attribution beyond trivial copying, we compare two occurrences of the same name in the same decomposition: the first mention retains strong credit while the duplicate-detection position is suppressed, a pattern absent in matched control prompts. Across the Pythia family from 160M to 6.9B parameters, this suppression pattern is consistently recovered at every scale, demonstrating that the method tracks mechanistic structure without ground-truth circuit labels. Code is available at this https URL.

[AI-28] Curriculum reinforcement learning with measurable task representation learning

链接: https://arxiv.org/abs/2605.23372
作者: Yongyan Wen,Siyuan Li,Mingjian Fu,Yiqin Yang,Xun Wang,Peng Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on sequencing candidate tasks, recent research explores automatic curriculum generation. Among the rich CRL literature, the interpolation-based CRL paradigm is a main body, which automatically generates intermediate tasks by interpolating between the initial task distribution and the target task distribution in task space with meaningful distance metrics (i.e., can measure the task similarity). However, in challenging navigation tasks, the non-Euclidean context (task) space invalidates this assumption. To achieve automatic curriculum generation in complex task, we propose a novel automatic curriculum generation approach based on measurable task representation learning. To better measure the similarity, we propose to transform the task space to a latent space. Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property, and two close task embeddings correspond to two similar tasks in terms of rewards and state transitions. Based on the learned task representation, we further develop an automatic curriculum generation scheme, which can effectively generate new tasks more and more similar to the target task. We evaluate our method in a variety of challenging navigation tasks, and the experiment results indicate that the proposed approach surpasses state-of-the-art CRL approaches based on interpolation and generative adversarial networks.

[AI-29] Score-Based One-step MeanFlow Policy Optimization

链接: https://arxiv.org/abs/2605.23365
作者: Kyungyoon Kim,Donghyeon Ki,Hee-Jun Ahn,Byung-Jun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising alternative by learning an average velocity field that maps noise to data in a single network evaluation. However, MeanFlow typically requires samples from the target distribution to construct its target velocity field, which are unavailable in online RL. We propose Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm that resolves this by constructing the target velocity field directly from the Q-function via score estimation and a probability flow ODE, thereby concentrating probability mass on high-value modes. In the fully online RL setting, SOM achieves state-of-the-art performance on locomotion tasks with a single generation step, while substantially reducing both training and inference time compared to prior diffusion- and flow-matching-based policies.

[AI-30] XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms

链接: https://arxiv.org/abs/2605.23348
作者: Tella Rajashekhar Reddy,Atharva Deshmukh,Liangcheng Yu,Chaojie Zhang,Mike Shepperd,Rohan Gandhi,Anjaly Parayil,Srinivasan Iyengar,Ajay Manchepalli,Debopam Bhattacherjee
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with high capital expenditure and long-distance transmission losses, yet there is abundant renewable energy at the source, just not matched to demand. This paper proposes a complementary AI infrastructure deployment model, AI Greenferencing, that brings modular AI compute to renewable energy sources, focusing on wind, allowing AI footprint expansion, generating local behind-the-meter demand for renewable sites, and helping ease the growing strain on power utilities. Our feasibility analysis shows that 890+ GW of wind capacity lies within 50 ms network round trip time of Azure data centers, and that site-wise right-sizing combined with spatial complementarity of wind energy keeps aggregate fleet utilization on par with traditional deployments. To serve inference requests under variable wind power, we build XWind, a lightweight, reactive, and workload-agnostic AI inference router that uses only real-time signals: inference latency, KV-cache utilization, and queue depth, to dynamically configure sites and distribute requests. Evaluated on a real 64-GPU A100 testbed emulating three wind-powered sites with Azure production traces, XWind reduces P99 end-to-end latency by up to 52% over the strongest contender (also our idea) and by up to 98% over baselines such as power-capping and GPU idling, with consistent gains across workload types, load levels, and GPU generations. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2605.23348 [cs.DC] (or arXiv:2605.23348v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.23348 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-31] Sparse Compositional Flow Matching by geometric assembly from motion primitives

链接: https://arxiv.org/abs/2605.23341
作者: Yan Tang,Yuanbo Tang,Tingyu Cao,Shaolun Huang,Yang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embodied trajectories, such as the executable motion sequences of robotic manipulators, underwater vehicles, and mobile robots, are a fundamental output of embodied AI. Modern generative models often treat them as a dense, monolithic signal generated point by point, fitting an intricate high-dimensional posterior while leaving the data’s latent structure unmodeled, the same sample inefficiency long identified by the structured generative model literature. We argue that a compositional latent structure is a natural choice: many embodied tasks share recurring motion fragments that can be made explicit as a finite repertoire of reusable motion primitives, and compositional units naturally align with subtask boundaries to support task decomposition. Existing compositional generators, however, compose in a latent space and rely on post-hoc decoding to relate sampled units to actual trajectory segments. We instead compose directly in the physical trajectory space through a flow-matching framework with two coupled designs. Motion-Primitive Dictionary Learning equips each atom with a learnable length mask and binary starting indicators so the atom itself is the primitive, reused verbatim wherever it is placed. Structural Sparse Flow Matching with Geometric Constraints then generates a binary placement matrix using duration-aware tokenization and a differentiable geometric loss that enforces spatial continuity and temporal contiguity where adjacent primitives meet. On Open X-Embodiment and 3DMoTraj, the framework attains state-of-the-art accuracy and reduces the FDE/ADE ratio from 1.8 to 1.07, improving ADE by 19.2% and FDE by 21.0% over the strongest baseline.

[AI-32] Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning MICCAI2026

链接: https://arxiv.org/abs/2605.23320
作者: Sijia Li,Xiaoyu Tan,Qixing Wang,Weiyi Zhao,Chen Zhan,Teqi Hao,Xuemin Wang,Lei Gu,Roland Eils,Xihe Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: miccai 2026

点击查看摘要

Abstract:Ventilator decision support requires sequential decisions that track evolving physiology and disease trajectories while respecting safety boundaries and clinician specific tuning styles. Rule based approaches rarely generalize personalization, and end to end reinforcement learning or single large language model systems remain difficult to control and audit. We propose the Ventilator Decision Support System (VDSS), a human in the loop multi agent framework that coordinates modular decision components through contract driven structured interfaces and produces traceable evidence for review. VDSS performs online preference adaptation with a contextual bandit, updating clinician specific preferences from the final accepted decision at each adjustment cycle and using them to guide subsequent recommendations. Structured rejection feedback triggers targeted replanning to reduce unproductive iterations and improve interaction stability. Retrospective ICU trajectory replay with expert review indicates higher recommendation acceptability and fewer interaction rounds to reach an acceptable plan, supporting clinically deployable human AI collaboration.

[AI-33] DART: Semantic Recoverability for Structured Tool Agents

链接: https://arxiv.org/abs/2605.23311
作者: Ke Yang,Panpan Li,Zonghan Wu,Kejin Xu,Huaxi Huang,Xiaoshui Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When a structured tool agent fails mid-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restoring from a local checkpoint is efficient but can leave committed downstream work tied to an upstream history that no longer exists. This tension is acute in commitment-sensitive settings, where rollback targets a single failed instance yet downstream consumers have already acted on its output. Existing recovery approaches provide mechanical rollback but no criterion for whether a local restore remains semantically valid after downstream commitment. We formalize this gap as semantic recoverability and address it in DART, a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries of that instance, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints-or blocks otherwise. Across three LLM-driven domains and external validation on a LangGraph-based substrate, DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. These results show that controller legality does not imply semantic validity, and that sound local recovery requires an explicit admissibility check.

[AI-34] Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems

链接: https://arxiv.org/abs/2605.23297
作者: Aasish Kumar Sharma,Julian M. Kunkel
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 6 pages, 3 figures. Accepted at the Security, Trust and Privacy for Software and Applications (STPSA) Workshop, IEEE COMPSAC 2026, Madrid, Spain, July 7-10, 2026

点击查看摘要

Abstract:AI-enabled services deployed in critical digital infrastructure are subject to governance obligations spanning transparency, accountability, fairness, and traceability. Compliance today remains documentation-centric: obligations are described in prose, audits rely on static checklists, and verification depends on manual review. Such approaches do not scale to automated AI systems. This paper introduces Ontological Knowledge Blocks (OKBs), a programmable governance infrastructure that compiles regulatory obligations into machine-checkable constraints over structured evidence graphs. We formalize an OKB as a 5-tuple that binds normative obligations to an RDF/OWL concept schema, executable SHACL validation rules, explicit evidence requirements, and PROV-O provenance links. A deterministic regulatory compiler translates structured Intermediate Representation (IR) records into composable KB modules, enabling profile-based governance reconfiguration without modifying service code. We implement two prototypes and evaluate them in an AI-assisted HPC resource allocation scenario across 24 validation runs and four governance profiles. Results demonstrate profile-sensitive validation, strictly additive violation accumulation, SHACL validation latency between 12.6 ms and 100.3 ms, and profile equivalence testing confirming Combined as the strictly most comprehensive profile. All artefacts are released as open source.

[AI-35] Parallel Context Compaction for Long-Horizon LLM Agent Serving

链接: https://arxiv.org/abs/2605.23296
作者: Musa Cim,Burak Topcu,Chita Das,Mahmut Taylan Kandemir
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon LLM agents accumulate growing conversation histories that eventually exceed the model’s context window. Context compaction via LLM-based summarization keeps the conversation bounded, but summarization is inherently lossy and the blocking call stalls agent inference for tens of seconds. Moreover, the operator has no fine-grained control over summary volume since prompt instructions are largely ignored, and as context grows, both the amount of output tokens the model produces and the information it retains fluctuate substantially from run to run, making the agent’s retained knowledge unpredictable across runs. We introduce \textbfparallel compaction for long-horizon agentic flows and characterize it against the sequential synchronous baseline across four backbones spanning 8B to 120B parameters, mixing dense and MoE architectures with reasoning and non-reasoning models, on the HotpotQA multi-hop QA and LoCoMo long-context dialogue benchmarks. Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.

[AI-36] Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints

链接: https://arxiv.org/abs/2605.23285
作者: Hoyun Choi,Junghyo Jo,Deok-Sun Lee
机构: 未知
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How network structure determines function is a fundamental question, and it can be investigated by graph ensembles with precisely controlled structural properties. Canonical approaches, formulated as exponential random graph models (ERGMs), enforce constraints only in expectation, allowing individual realizations to fluctuate around the target. Conversely, microcanonical ensembles impose hard constraints exactly, but practical sampling methods beyond fixing the degree sequence have remained out of reach. Here we introduce the Deep Microcanonical Graph Generator (DMGG), a reinforcement learning (RL) framework that transforms any given graph through degree-preserving rewirings to exactly reach a prescribed assortativity, which characterizes the degree–degree correlation of adjacent nodes. Instead of relying on the entropically dominated Metropolis–Hastings dynamics of the ERGM, DMGG employs a policy-guided search that maximally alters the joint-degree matrix. This eliminates exhaustive parameter tuning and accelerates generation by at least an order of magnitude while preserving configurational diversity. As DMGG generalizes across various graph sizes, sparsities, and topologies, it provides exact null models that allow for the quantitative isolation of secondary observables, such as the clustering coefficient. These results establish RL as a practical and powerful paradigm for generating hard-constrained graphs, opening avenues to investigate structure-function relationships free from ensemble artifacts.

[AI-37] When Good Equations Get Bad Scores: Improving Symbolic Regression Through Better Parameter Optimization

链接: https://arxiv.org/abs/2605.23272
作者: Boxiao Wang,Kai Li,Zhiwei Chen,Yang Huang,Runxiang Wang,Ziwen Zhang,Yifan Zhang,Jian Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Symbolic Regression (SR) plays a central role in scientific knowledge discovery by distilling mathematical equations from observational data. Most existing SR methods function within a bi-level optimization framework: an outer loop that searches for the discrete equation structure, and an inner loop that optimizes the continuous parameters of that structure. Crucially, parameter-fitting quality directly determines a structure’s score and thus the outer-loop search. However, nonlinear operators make the inner loop highly non-convex, and budget-driven reliance on fast local solvers (e.g., BFGS) often yields poor local minima and underestimated scores for correct structures. This ``Good Structure, Bad Score’’ phenomenon becomes a key bottleneck, degrading efficiency and misguiding the search away from the true equation. To resolve this, we propose SAGE-Fit (Structure-Aware and Semantics-Guided Evaluator for Symbolic Regression), an SR-native fitting framework that exploits the dual native priors of symbolic expressions. By capitalizing on the structural and semantic priors unique to SR, we design tailored modules for each property, thereby effectively mitigating this optimization bottleneck. Extensive experiments demonstrate that our approach, as a plug-and-play module, significantly enhances evaluation fidelity and universally improves the performance of various SR systems.

[AI-38] 6G Communication Networks Enabling Embodied Agents : Architecture and Prototype

链接: https://arxiv.org/abs/2605.23263
作者: Lipeng Dai,Luping Xiang,Kun Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Embodied agents, which couple intelligent decision-making with physical actuation in the real world, impose far more stringent and heterogeneous communication requirements than purely software-based agents. While 6G promises sub-millisecond latency, ultra-high reliability, native intelligence, and integrated sensing, systematic studies on how to exploit these capabilities for embodied agent communication remain limited. This article investigates 6G-enabled communication systems for embodied agents from both conceptual and engineering perspectives. First, we review the concept, embodiment value of embodied agents, and clarify their distinctions from disembodied agents. Then, we analyse the symbiotic relationship between embodied agents and 6G networks. We highlight how key 6G enablers can support the stringent requirements of human-robot interaction. Furthermore, we demonstrate the proactive role of embodied agents in bolstering communication networks through coverage extension, environmental sensing, and physical world understanding. Building on these insights, we propose a hierarchical communication architecture for human-robot remote interaction, comprising a human-intent perception layer, an open radio access network (O-RAN)-based transport layer, an intelligent intermediary layer, and an embodiment layer. To validate its feasibility, we implement an end-to-end prototype that integrates a haptic device, an industrial robotic arm, an intermediary platform, and a 5G O-RAN testbed. Experimental results demonstrate millisecond-level latency and stable closed-loop operation, confirming the practicality of the proposed architecture and providing a reference for future 6G-embodied agent research and industrial deployments.

[AI-39] Design and Report Benchmarks for Knowledge Work

链接: https://arxiv.org/abs/2605.23262
作者: Yining Hua,Hongbin Na,Cyrus Ayubcha,Levi Lian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O*NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

[AI-40] Enhancing Deep Neural Network Reliability with Refinement and Calibration ICLR2026

链接: https://arxiv.org/abs/2605.23249
作者: Ramya Hebbalaguppe,Ajay Shastry,Soumya Suvra Ghosal,Chetan Arora
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026, Trustworthy AI and Representational Alignment

点击查看摘要

Abstract:Although deep neural networks (DNNs) achieve high predictive accuracy, their confidence estimates are often unreliable, potentially compromising user trust in their decisions. This has motivated research on calibrated models, where calibration measures how well a model’s predicted confidence aligns with the empirical probability of correctness. However, calibration metrics can often be improved through post-processing techniques that merely mimic training-time uncertainty without genuinely improving the model’s understanding. For this reason, statisticians recommend that models be not only calibrated but also refined. Intuitively, a model is considered more refined if it assigns significantly different confidence scores to correct and incorrect predictions, a property also referred to as sharpness. We observe that many existing calibration methods improve calibration at the cost of reduced refinement. To address this limitation, we propose: (1) a novel loss function that explicitly promotes refinement and can be optimized through supervised contrastive learning; and (2) a unified training framework, RefCal, that jointly optimizes calibration, refinement, and accuracy to improve DNN reliability. On the CIFAR-100-LT dataset with 10 percent class imbalance, RefCal achieves (accuracy, refinement, ECE) of (58.81, 95.67, 0.08), substantially outperforming the widely used Correctness Ranking Loss, which achieves (46.27, 93.7, 0.22).

[AI-41] Are Frontier LLM s Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

链接: https://arxiv.org/abs/2605.23243
作者: Vivek Dahiya,Sunny Nehra,Vipul Dholariya,Bhavik Shangari,Chandra Khatri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

[AI-42] PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows ICPR2026

链接: https://arxiv.org/abs/2605.23219
作者: Minju Kim,Youngbum Hur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICPR 2026

点击查看摘要

Abstract:Time series forecasting plays a central role in many real-world applications and has been extensively studied. Most existing approaches rely on deterministic models. However, real-world environments exhibit inherently uncertain and complex future behaviors, making single-point predictions insufficient. This highlights the need for probabilistic forecasting methods that can quantify and represent uncertainty. In this work, we propose PaP-NF, a probabilistic forecasting framework that aligns continuous time series representations with a frozen large language model (LLM) using a Prefix-as-Prompt mechanism, and conditions a normalizing flow decoder on the global context extracted by the LLM. The quality of the resulting predictive distributions is evaluated using the Continuous Ranked Probability Score (CRPS), a standard metric in probabilistic forecasting. Across a variety of long-term forecasting benchmarks, PaP-NF robustly captures multi-modal uncertainty while maintaining competitive point forecasting accuracy. The official implementation is available at: this https URL

[AI-43] Foundation Protocol: A Coordination Layer for Agent ic Society

链接: https://arxiv.org/abs/2605.23218
作者: Bang Liu,Yongfeng Gu,Jiayi Zhang,Zhaoyang Yu,Sirui Hong,Maojia Song,Xiaoqiang Wang,Mingyi Deng,Zijie Zhuang,Ronghao Wang,Mingzhe Cao,Yutong Zhu,Xingjian Li,Yifan Wu,Jianhao Ruan,Yiran Peng,Shuangrui Chen,Jinlin Wang,Yizhang Lin,Dongjie Zhang,Dekun Wu,Chen Ma,Lizi Liao,Han Yu,Jian Pei,Heng Ji,Qiang Yang,Yuyu Luo,Chenglin Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another. As these systems scale, the bottleneck shifts away from raw model capability toward coordination. Agents need to form reliable relationships, organize multi-agent work, exchange value, support an AI economy, and stay safe and accountable under real-world oversight. This paper introduces the Foundation Protocol (FP), a graph-first coordination layer for an emerging human-AI society. FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations, and supports native multi-party organization and event-based collaboration. It also provides economic primitives for metering, receipts, and settlement, and treats policy, provenance, and audit as first-class concerns. FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead. The aim is to keep autonomous agency composable while keeping accountability non-negotiable, so that coordination itself can become shared infrastructure for a human-AI society that is open, pluralistic, and governable.

[AI-44] AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

链接: https://arxiv.org/abs/2605.23204
作者: Guiyao Tie,Jiawen Shi,Dingjie Song,Yixiao Huang,Ziji Sheng,Xueyang Zhou,Daizong Liu,Pan Zhou,Yongchao Chen,Ran Xu,Lifang He,Qingsong Wen,Manling Li,Cong Lu,Shuai Li,Pengtao Xie,Yixuan Yuan,Rui Meng,Lei Xing,Lichao Sun,Caiming Xiong,Philip S. Yu,Jianfeng Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 49 pages, 12 figures, 10 tables

点击查看摘要

Abstract:Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding, hypothesis generation, experimentation, validation, reporting, and revision. This shift marks a transition from task-level AI for science to workflow-level research automation. Yet current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight, while still struggling with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure. This survey examines these developments through AutoResearch, defined as the developmental spectrum of AI-powered scientific workflow automation. Within it, Vibe Research denotes the human-steered region of prompt-based assistance and human-verified execution, whereas emerging AI-led systems coordinate larger portions of the discovery loop without achieving robust autonomy. We analyze how research systems redistribute control, evidence, execution, validation, and accountability across workflows and organize the field around five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication. We further synthesize AI scientist systems, mixed-initiative co-research frameworks, benchmarks, domain deployments, and open-source infrastructures. Finally, we propose five evaluation dimensions–novelty, validity, impact, reliability, and provenance–and show that AutoResearch autonomy is domain-conditioned, being more credible in structured, executable, and rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.

[AI-45] Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

链接: https://arxiv.org/abs/2605.23200
作者: Junzhe Yang,Xiaoyu Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.

[AI-46] Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids

链接: https://arxiv.org/abs/2605.23194
作者: Massimiliano Lupo Pasini,Yijiang Li,Kibaek Kim,Teja Kuruganti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 tables, 4 figures

点击查看摘要

Abstract:Fast and reliable optimal power flow (OPF) approximation is essential for reliable smart-grid operation, yet many learning-based surrogates either flatten the native heterogeneous structure of power networks, target a limited set of grid topologies, or lack scalable infrastructure for graph foundation model (GFM) training. This paper presents a scalable heterogeneous graph neural network (GNN) workflow, built on HydraGNN, for data-driven OPF surrogate modeling and OPF-GFM development. The workflow preserves the distinct node and edge types of power grids – buses, generators, loads, shunts, AC lines, transformers, and device-to-bus couplings – and supports distributed preprocessing, training, hyperparameter optimization (HPO), and downstream fine-tuning on leadership-class supercomputers. Using three million heterogeneous graph instances spanning ten PGLib-OPF cases, from 14 to 13,659 buses, we conduct DeepHyper-driven HPO on the ORNL Frontier supercomputer. The campaign identifies compact models ( \sim 1.6–1.7M parameters) with the lowest validation losses. Downstream experiments on feasibility classification and N-1 contingency regression show that fine-tuning pretrained OPF GFM improves low-data accuracy, stabilizes training, accelerates convergence, and reduces adaptation cost when partial or head-only fine-tuning is used.

[AI-47] Redrawing the AI Map: A Theory of Accountability Boundaries in Agent ic Ecosystems

链接: https://arxiv.org/abs/2605.23179
作者: Muhammad Zia Hydari,Farooq Muzaffar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI orchestrators reduce the interface and assembly costs of composing information systems capabilities across organizational boundaries, seemingly accelerating modularization and organizational disaggregation. Yet AI-enabled capabilities whose outputs require evidence, review, signoff, or assignable responsibility may retain integrated accountability boundaries even when their technical interfaces become modular. We develop a capability-level theory of accountability-boundary placement in agentic ecosystems. We introduce accountability assets: complementary assets that make AI-supported outputs legitimate, auditable, reviewable, and assignable to a responsible party. We argue that verification cost and responsibility transferability determine whether the execution and accountability boundaries can move together. The theory identifies three boundary strategies: component, integrated, and dual-track. It also introduces rule debt, the governance burden that accrues when organizational decision rules migrate from formal information systems into ungoverned agentic execution environments. Integrating digital innovation, transaction cost, complementary-assets, digital platform governance, and IS control perspectives, we develop seven propositions linking agentic assembly-cost reductions, accountability assets, appropriability, orchestrator intent capture, and boundary misconfiguration to boundary strategy, value appropriation, and rule debt. The theory explains when digital modularization extends to organizational disaggregation and when accountability keeps capabilities integrated. Structured illustrations across document processing, legal services, audit, clinical decision support, and procurement discipline the boundary logic.

[AI-48] Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

链接: https://arxiv.org/abs/2605.23171
作者: Abhay Yadav
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: arXiv admin note: substantial text overlap with arXiv:2312.01523

点击查看摘要

Abstract:Recent advancements in instructional fine-tuning have injected noise into embeddings, with NEFTune (Jain et al., 2024) setting benchmarks using uniform noise. Despite NEFTune’s empirical findings that uniform noise outperforms Gaussian noise, the reasons for this remain unclear. This paper aims to clarify this by offering a thorough analysis, both theoretical and empirical, indicating comparable performance among these noise types. Additionally, we introduce a new fine-tuning method for language models, utilizing symmetric noise in embeddings. This method aims to enhance the model’s function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune (64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.

[AI-49] PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLM s

链接: https://arxiv.org/abs/2605.23168
作者: Luze Sun,Anshuman Suri,Harsh Chaudhari,Cristina Nita-Rotaru,Alina Oprea
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:When practitioners fine-tune LLMs on unvetted datasets, an adversary can exploit the data supply chain through task-level poisoning: inserting a small number of crafted instruction-response pairs that cause the model to embed attacker-specified entities, such as a country, in outputs for a targeted task family while behaving normally elsewhere. We introduce PoisonForge, a benchmark that parameterizes this threat along four dimensions (bias type, poisoning mode, appearance count, and target output length) and evaluates 12 open-weight models (from 2B to 32B parameters) across five families under a primarily 1% poison budget. With only 10 poisoned examples among 1,000 fine-tuning examples, 11 of 12 models exceed a 70% attack success rate (ASR) in their most vulnerable configuration. Meanwhile, unintended leakage to non-target tasks remains below 0.5%, and models perform well on standard benchmarks. We analyze in detail the factors contributing to attack success. We observe that multiple appearances of an entity increase the ASR, the optimal poisoning mode depends on the semantic structure of the target entity, and ASR drops monotonically with the task output length. A correlation analysis and risk prediction model confirm that poisoning design choices, rather than model scale, are the primary causes of attack success, and that these patterns generalize to predict attack success on new tasks. We release all configurations, pipelines, and analysis code to support reproducible comparisons.

[AI-50] Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

链接: https://arxiv.org/abs/2605.23146
作者: Manish Aryal,Faiyaz Azam,Agnivo Banerjee,Sai Sidhanth Manoharan Jayanthi,Allegra Laro,Clément Legentilhomme,Andrew Lin,Florian Lorkowski,Radman Rakhshandehroo,Patric Rommel,Emanuel Ruzak,Nathan Theng,Paul Yushin Rapoport
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent’s policy. This assumption breaks down in non-realizable settings where other actors might anticipate the agent’s behavior, including environments crucial to AI safety, where the agent interacts with predictors, humans, other AI agents, and institutions. In such settings, the agent’s model class fails to capture the world in which it operates. Under such misspecification, classical Bayesian methods can produce confidently wrong posteriors, unreliable decisions, and unbounded regret, as realizability fails to obtain. Infra-Bayesianism is a decision-theoretic framework that addresses these failures by distinguishing ordinary probabilistic uncertainty, where priors can be reasonably chosen, from Knightian uncertainty, where no grounds exist for the construction of such a prior. It does so by evaluating actions on their worst-case outcomes, rather than from posterior expectations or weighted averaging. We present the first proof-of-concept implementation of an infra-Bayesian reinforcement learning architecture for finite-outcome stateless decision problems. Our agent maintains a set of imprecise hypotheses, updates them using infra-Bayesian conditioning, and selects actions by maximizing worst-case expected value. We apply this implementation of the infra-Bayesian maximin decision process to an environment with Knightian uncertainty, and demonstrate a lower worst-case regret as compared to classical reinforcement learning agents. We also investigate Newcomb’s problem and show that the infra-Bayesian agent picks the optimal strategy, outperforming classical decision theory agents. Our results provide a step towards reinforcement learning agents that remain robust under model misspecification and policy-dependent uncertainty.

[AI-51] CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection ICPR2026

链接: https://arxiv.org/abs/2605.23139
作者: Jaehyeop Hong,Youngbum Hur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICPR 2026

点击查看摘要

Abstract:Multivariate time series anomaly detection has become increasingly important in real-world applications, where labeled data are often scarce. Many existing approaches rely on unsupervised learning to model normal patterns, but they often treat all channels equally. This design can dilute anomaly-relevant signals, since not all channels contribute equally to anomaly detection. In this paper, we propose CALAD, a channel-aware contrastive learning framework for multivariate time series anomaly detection. CALAD governs the construction of contrastive samples using estimated channel relevance, allowing the learning process to reflect anomaly semantics rather than generic similarity. Channel relevance is estimated from reconstruction errors of a transformer-based autoencoder and is used to distinguish channels that are more influential to anomalous behaviors. Using this information, we design a channel-wise augmentation strategy in which positive and negative samples are constructed based on whether anomaly-relevant channels are preserved or perturbed. This encourages invariance to changes in irrelevant channels while being sensitive to changes in anomaly-relevant channels. Furthermore, CALAD combines contrastive learning and an auxiliary reconstruction head, allowing the model to learn discriminative representations while retaining normal structures. Experiments on multiple real-world datasets shows that CALAD consistently outperforms existing methods, particularly under distribution shift scenarios. We provide the code for reproducibility at this https URL

[AI-52] Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

链接: https://arxiv.org/abs/2605.23109
作者: Shubham Agarwal,Alexander Krentsel,Shu Liu,Mert Cemri,Audrey Cheng,Rui Meng,Tomas Pfister,Chun-Liang Li,Sylvia Ratnasamy,Aditya Parameswaran,Matei Zaharia,Ion Stoica,Mohsen Lesani
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and 106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

[AI-53] Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study

链接: https://arxiv.org/abs/2605.23108
作者: Kaushal Bansal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-assisted code review tools typically operate as generic “expert reviewer” agents, producing homogeneous findings regardless of the analysis type needed. We present a system that constrains AI reviewer behavior through philosophical dispositions – coherent personality lenses grounded in specific epistemological traditions (Pyrrhonist Skepticism, Navya-Ny=aya logic, Diogenes’ Cynicism, Confucian relational ethics) that direct attention to structurally different types of issues. Each disposition is defined apophatically (by what it refuses to do), equipped with a self-monitoring failure mode (hamartia), and orchestrated in sequence by role protocols. We evaluate this system on 50 merged pull requests across 7 repositories spanning 5 programming languages (Python, Go, C++, Java, Terraform), 5 organizations (2 enterprise, 3 open-source), and 2 temporal eras (pre-AI 2020, post-AI 2024–2026). The disposition system achieves 46% convergence with human reviewers (validating signal quality), identifies unique findings at a 75% rate, and produces no findings judged false-positive by the author across 601 total findings (inter-rater agreement was not assessed and remains a limitation). A controlled baseline comparison demonstrates that 51% of disposition findings are not produced by the same model using generic “expert reviewer” prompting, and these unique findings target structural, operational, and logical concerns rather than standard code-level issues. Preliminary cross-model validation (Claude Opus vs.\ GPT Codex 5.3-xhigh) on 3 PRs shows 100% framework-structure adherence with 39% finding-level agreement, suggesting the framework provides real behavioral constraint while preserving model-specific analytical perspective. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.23108 [cs.SE] (or arXiv:2605.23108v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.23108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-54] Security of LLM -generated Code: A Comparative Analysis

链接: https://arxiv.org/abs/2605.23091
作者: Srivathsan G Morkonda,Mahmoud Selim,Hala Assal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The majority of software developers use or are planning to use Artificial Intelligence (AI) tools in their development processes. Their top reasons include improving productivity and faster learning. In fact, Large Language Model (LLM)-generated code is currently in production, including in major tech companies. However, concerns were raised about the risks associated with the use of AI tools to generate code. In this paper, we focus our attention on the risks to software security. We empirically evaluate the security of code generated by seven popular LLMs. We build upon previous work to mimic the behaviours of developers when using LLMs to generate code. Our results show that all seven LLMs that we have evaluated generate code that contains vulnerabilities, the majority of which are of critical or high severity.

[AI-55] Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

链接: https://arxiv.org/abs/2605.23089
作者: Romil V. Sonigra(1),P. R. Kumar(1) ((1) Texas Aamp;M University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages and 9 figures

点击查看摘要

Abstract:Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smoothness in their learned transition dynamics, leaving a useful inductive bias for transition dynamics learning unexploited. We propose GPLD, a gradient-penalized latent dynamics regularizer for DreamerV3 that applies a row-wise Jacobian penalty to the posterior latent distribution to encourage locally smooth transition learning. We show that this penalty can be interpreted as the continuous-latent analog of finite-difference smoothing of transition laws in discrete embedded-state MDPs, and estimate it efficiently using Hutchinson-style stochastic probes. Empirically, across DeepMind Control proprioceptive tasks, GPLD improves aggregate sample efficiency, with particularly strong gains on higher-complexity locomotion environments. On more challenging quadruped tasks, GPLD reaches high-return behavior earlier and exhibits more consistent late-stage learning over longer horizons. Explicit local smoothness regularization is a simple and effective way to improve latent world models for smooth continuous control environments. Code for GPLD is available at this http URL .

[AI-56] PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning

链接: https://arxiv.org/abs/2605.23074
作者: Lingyu Jiang,Zirui Li,Shuo Xing,Peiran Li,Tsubasa Takahashi,Dengzhe Hou,Zhengzhong Tu,Kazunori Yamada,Fangzhou Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures, 7 tables

点击查看摘要

Abstract:The emergence of Large Reasoning Language Models (LRMs) has paved the way for tackling complex reasoning tasks through test-time scaling by generating long-form Chain-of-Thought (CoT) trajectories during inference. Meanwhile, these trajectories often contain explicit reflection markers such as wait'', but’‘, and ``alternatively’', signaling hesitation, revision, and the consideration of alternative explorations, respectively. Recent studies on test-time control leverage such markers as lightweight handles for steering reasoning, typically treating them as a single coarse-grained category rather than distinguishing their distinct functional roles. In this paper, we conduct type-wise suppression and fixed-prefix intervention, revealing that reflection markers differ not only in their functional roles but also in when they exert the greatest influence. Specifically, different marker classes affect accuracy and generation length in distinct ways, and marker choices are most consequential before the model settles into a stable reasoning trajectory. Motivated by these findings, we introduce PathCal, a novel training-free decoding controller that calibrates reasoning paths by distinguishing marker types and intervening only at locally uncertain states. At each decoding step, PathCal utilizes the distribution over reflection-markers to estimate local competition between maintaining the current reasoning trajectory and initiating a competing branch, and softly rebalances marker logits when competing-branch evidence becomes excessive. Experiments across six reasoning benchmarks demonstrate that PathCal achieves a better efficiency–performance trade-off, improving or preserving accuracy while reducing generation length, without relying on external verifiers or additional sampling.

[AI-57] Anytime Training with Schedule-Free Spectral Optimization

链接: https://arxiv.org/abs/2605.23061
作者: Anuj Apte,Pranav Deshpande,Niraj Kumar,Shouvanik Chakrabarti,Junhyung Lyle Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across 1 – 8\times Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

[AI-58] A measurement substrate for agent ic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification

链接: https://arxiv.org/abs/2605.23058
作者: Joshua Odmark,Gideon Rubin,Deon van der Vyver
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages. Code at this https URL tag v0.1.0 (Apache 2.0). Source repo at this https URL tag arxiv-v1

点击查看摘要

Abstract:Empirical claims about autonomous Kubernetes operations agents are largely unfalsifiable. Published work reports observational results without controlled comparisons against an agent-disabled baseline, selection bias is endemic, pre-registered decision matrices are absent, and samples are typically too small for the noise level of the underlying scoring system. The cause is the same gap that limits the agents themselves: code agents have a verification substrate that turns “did it work” into a fast, falsifiable, ground-truth signal, and operations has nothing equivalent. We present agent-breakage, a closed-loop measurement framework that injects faults into a target Kubernetes cluster, observes how an autonomous agent responds, scores the response on four axes against ground truth, and accumulates outcome-labeled (state, action, outcome) tuples. The framework distinguishes framework error from reasoning error, supports a true off-condition control via a deterministic-embedder mechanism, and enforces pre-registered decision matrices. We use it as a case study to test whether retrieval over past postmortems compounds an agent’s capability. The methodological payload is three confounds the substrate caught during that case study, each of which would have produced a wrong published claim on a less instrumented version of the same work: a pgvector index bug, a +19% selection-bias artifact, and small-sample estimates that overstated effects by roughly 3x. The retrieval result itself is a partial falsification: 1 of 3 dense-corpus scenarios significant at p0.05, pooled effect +3.9 percentage points, not significant at n=60. A within-scenario corpus-density sweep at 360 runs shows that mechanistic alignment of near-neighbors dominates raw count. The framework is released open source.

[AI-59] DRL-Driven Edge-Aware Utility Optimization for Multi-Slice 6G Networks

链接: https://arxiv.org/abs/2605.23056
作者: Khaled M. Naguib,Soumaya Cherkaoui,Mahmoud M. Elmessalawy,Ahmed M. Abd El-Haleem,Ibrahim I. Ibrahim
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:Virtual Reality (VR) services delivered over 6G networks demand ultra-low latency and high bandwidth to ensure seamless user experiences. This paper presents an intelligent resource allocation and edge caching framework for 6G O-RAN networks, leveraging Deep Q-Network (DQN) learning for optimizing edge caching and dynamic resource provisioning across multiple network slices within an O-RAN-compliant architecture. By incorporating DRL agents into the network control plane, the proposed system enables proactive and adaptive content distribution as well as real-time computational resource allocation that meets the quality-of-service demands of eMBB, URLLC, and especially the emerging MBRLLC slices essential for VR. Simulation results demonstrate that the DQN-based framework consistently outperforms traditional methods in reducing latency and improving throughput, leading to more reliable and responsive support for immersive VR applications in 6G environments.

[AI-60] Uncovering the Latent Potential of Deep Intermediate Representations ICML2026

链接: https://arxiv.org/abs/2605.23033
作者: Arnesh Batra,Arush Gumber,Aniket Khandelwal,Jashn Khemani,Anubha Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML2026 as a Spotlight

点击查看摘要

Abstract:Foundational Models pretrained on huge amount of data learn representations that evolve across depth, forming a hierarchy of embeddings with distinct semantic content and geometric structure. Contrary to the widespread practice of using only the final layer or shallow mixtures, we show that task-relevant information is distributed non-monotonically across layers and cannot be recovered by naïve aggregation. Through a geometric and empirical study across multiple modalities, we show that effective transfer depends on identifying which layers encode task-discriminative structure and how their embeddings are geometrically organized. We introduce Layer-wise Optimal Embedding Selection (LOES), a constructive spectral method that identifies task-discriminative subspaces by minimizing residual error under orthogonality and isotropy constraints. To align fine-tuning with this selection principle, we further propose Geometric Regularization Loss (GeoReg), which enforces a simplicial structure on class manifolds and stabilizes representation geometry during fine-tuning. Across a wide range of architectures, depths, modalities, and data regimes, LOES consistently outperforms standard baselines, with gains that grow as model depth increases. Beyond accuracy, our method reveals how semantic factors are distributed across layers, thereby enabling cross-lingual and cross-modal interpretability analyses. Together, our results provide strong evidence that layerwise embedding geometry is not incidental but central to how deep models represent and transfer knowledge.

[AI-61] Whose Good Whose Place? The Moral Geography of Agent ic AI for Social Good

链接: https://arxiv.org/abs/2605.22995
作者: Poli Nemkova,Haeshitha Indukuri,Jaedon Charles
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI systems are increasingly proposed for social-good domains, often invoking the United Nations Sustainable Development Goals (SDGs) as a vocabulary of global benefit. Yet claims of social good do not establish accountability to the communities a system claims to serve. We present a structured survey of 112 papers on agentic AI for social good published between 2015 and 2026. We find a moral-geographic asymmetry: papers are least likely to specify geographic context in precisely the domains where local political, legal, and cultural context matters most. Across the corpus, 82 of 112 papers (73%) specify no geographic context. Papers aligned with health or physical/ecological SDGs specify geography 37-40% of the time, while papers aligned with institutional and social-policy SDGs do so only 13%. SDG 16, peace, justice, and strong institutions, is both the most-covered goal in the corpus and the one with the lowest geographic-specification rate. We interpret this as moral abstraction: agentic AI for social good often treats institutional good as universal in ways it does not treat health or ecological good. A second finding compounds this: only 28 of 112 papers (25%) report any real-world deployment or small-scale test. We identify five accountability gaps and propose a minimal reporting standard for more context-specific, participatory, and accountable agentic AI for social good. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.22995 [cs.CY] (or arXiv:2605.22995v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2605.22995 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-62] st-Time Training Undermines Safety Guardrails

链接: https://arxiv.org/abs/2605.22984
作者: Simone Antonelli,Sadegh Akhondzadeh,Aleksandar Bojchevski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 4 figures. Project page: this https URL

点击查看摘要

Abstract:Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few-shot learning, retrieval-augmented generation, and complex reasoning. However, this dynamic adaptation introduces new vulnerabilities that adversaries can exploit to jailbreak models. We identify three threat models for TTT and demonstrate how attackers can leverage them to bypass safety filters. Our results show that TTT can significantly increase the Attack Success Rate (ASR) and the ASR over 10 generation trials (ASR@10). For example, under LoRA, the few-shot and generation-phase threat models achieve an average ASR@10 of 95% and 93% respectively, across models from different families and scales. These vulnerabilities transfer to production fine-tuning APIs. We also show that TTT-induced overfitting can produce degenerate outputs that inflate ASR under standard judges, and propose a validity-aware evaluation to correct for this. Our findings suggest that TTT exposes a new attack surface, strengthens attacks, and undermines existing safety guardrails. As a first step toward defense, we propose a lightweight provider-side detector that flags TTT requests via the perplexity shift on a private harmful holdout, but robust deployment will ultimately require dynamic alignment.

[AI-63] LLM Code Smells: A Taxonomy and Detection Approach

链接: https://arxiv.org/abs/2605.22976
作者: Zacharie Chenail-Larcher,Brahim Mahmoudi,Naouel Moha,Quentin Stiévenart,Florent Avellaneda
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into software systems for diverse purposes, due to their versatility, flexibility, and ability to simulate human reasoning to some extent. However, poor integration of LLM inference in source code can undermine software system quality. Therefore, inadequate LLM integration coding practices must be documented to help developers mitigate such issues. Following our earlier work on LLM code smells, this paper consolidates and refines the concept by presenting a self-contained taxonomy and a catalog of nine LLM code smells. We also create SpecDetect4LLM, a static source code analysis tool for their detection, and conduct extensive empirical evaluations of its detection effectiveness (precision and recall) as well as the prevalence of LLM code smells across 692 open-source software projects (171,194 source files). Our results show that LLM code smells affect 73.5% of the analyzed systems, with a detection precision of 91.3% and a recall of 71.8%.

[AI-64] Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection

链接: https://arxiv.org/abs/2605.22973
作者: Muhammad Rajabinasab,Michael E. Houle,Oussama Chelly,Arthur Zimek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint submitted to Elsevier Pattern Recognition Letters

点击查看摘要

Abstract:Many novel unsupervised feature selection methods are proposed each year, yet their empirical evaluation is limited to supervised and unsupervised evaluation metrics computed on selected datasets, along with comparisons to existing methods. However, in the absence of an established evaluation baseline, it is difficult to determine the value added to the existing literature by each of these methods, and how effective their underlying approaches are. We propose using random feature selection as a baseline for evaluating the unsupervised feature selection methods. We empirically show that many of the state-of-the-art methods in unsupervised feature selection are outperformed by random feature selection in both performance and efficiency. Accordingly, we emphasize on the strict requirement of considering random feature selection as a baseline in the development process of novel unsupervised feature selection methods to ensure a consistent improvement over random feature selection.

[AI-65] A mathematical theory of balancing relational generalization and memorization

链接: https://arxiv.org/abs/2605.22972
作者: Luke Cheng,Samuel Lippl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans, animals, and modern machine learning models exhibit impressive abilities to learn complex behaviors and generalize these behaviors to unseen situations. This ability requires us to learn rules and regularities that allow for such generalizations. At the same time, in most complex environments, any rule will have its exceptions. How do learning systems balance between learning general regularities and memorizing exceptions? We argue that a lack of task paradigms has hindered the study of this essential ability. To address this gap, we introduce a novel task, transitive inference with exceptions, that tests for relational generalization and memorization of an exception to the relational rule. We then analytically characterize the behavior of a simple, theoretically tractable model of neural network learning (kernel ridge regression) across a broad family of representations and task parameters. We find that these models can balance between relational generalization and memorization, but unlike for transitive inference without an exception, successful generalization is sensitive to the specific representational geometry. We explain why this task is more challenging mechanistically by drawing on our analytical theory. Finally, we validate our theoretical insights in pretrained language models that are finetuned on ordered relations, finding that these models successfully generalize according to the transitive rule, but also make the kinds of systematic mistakes predicted by our theory. Overall, our theory shows how learning systems can balance between relational generalization and memorization, explains how this can go wrong, and emphasizes the need for new task paradigms designed to probe this ability.

[AI-66] Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning

链接: https://arxiv.org/abs/2605.22940
作者: Kim Phuc Tran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Submitted to JMLR

点击查看摘要

Abstract:Deep learning is increasingly viewed as a dynamical process in parameter space, yet many existing theories still treat training as a closed optimization system. This view is limited for real-world AI, where models operate under uncertainty, resource constraints, distribution shift, downstream decision risks, and human feedback. We propose Human-Centered Learning Mechanics (HCLM), a dynamical and information-theoretic framework for open and controlled learning systems. The central idea is that entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory. Otherwise, entropy terms may produce weak, unstable, or misaligned gradients, causing the dynamics to collapse toward ordinary loss minimization. We introduce the notion of effective entropy and study tractable geometric entropy surrogates, including variance-based and log-determinant covariance proxies. The paper makes three contributions. First, it formalizes entropy regularization through effective information force and characterizes degenerate entropy regimes. Second, it derives convergence, entropy-flow, Wasserstein-gradient-flow, and noisy-representation generalization results under explicit assumptions. Third, it offers a conditional dynamical interpretation of scaling-law-like behavior as a balance between information injection, entropy dissipation, and residual risk, without claiming an unconditional derivation of empirical neural scaling laws. Controlled representation-learning experiments support the hypothesis that geometric entropy surrogates, especially log-determinant covariance entropy, induce stronger and more stable information forces than softmax-normalized entropy.

[AI-67] Mediative Fuzzy Logic: From Type-1 Foundations to Type-2 Type-3 and Quantum Extensions

链接: https://arxiv.org/abs/2605.22900
作者: Oscar Montiel Ross
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Quantum Physics (quant-ph)
备注: 30 pages, 1 figure

点击查看摘要

Abstract:Mediative Fuzzy Logic was conceived as a practical scheme for reconciling hesitant or conflicting assessments in fuzzy control and decision-making. However, its logical and semantic foundations remain underdeveloped, especially beyond operational type-1 settings. This article develops a unified account of the type-1 core together with interval type-2, granular type-3, and quantum extensions. We characterize the mediative operator as a convex aggregation controlled by hesitation and contradiction, model mediative truth values as independent truth-falsity pairs in a continuous bilattice-like structure, and introduce a propositional system extending a standard t-norm-based fuzzy logic with a mediative connective. We establish soundness, paraconsistency, and conservativity over the underlying fuzzy base for formulas without mediation, and formulate coherent semantic extensions to interval type-2 truth values, granule-indexed local evaluations, and effects and density operators on Hilbert spaces. An autonomous-braking sensor-fusion example illustrates how the framework supports transparent, conservative, and safety-first decisions under incomplete, heterogeneous, and mildly contradictory evidence. Under suitable assumptions, the higher-level formulations reduce to the type-1 case, clarifying coherence across levels and reliably supporting future work in intelligent decision systems.

[AI-68] Agent ic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.22896
作者: Ruofan Jin,Zaixi Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Total 15 pages

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for robotic manipulation by leveraging pre-trained vision-language representations. However, current VLA training methods suffer from two critical limitations: poor generalization to novel environments and low training efficiency requiring extensive demonstrations. We introduce Agentic-VLA, an agentic training framework that enables VLAs to efficiently adapt online through three key innovations: (1) Adaptive Reward Synthesis, which dynamically generates and adjusts reward functions based on the VLA’s current capabilities and task complexity, decomposing complex tasks into learnable sub-goals for curriculum learning; (2) Language-Guided Exploration, where a critic model provides structured guidance for systematic exploration rather than random sampling; and (3) Experience Memory,which stores and retrieves task-relevant policy weights for warm-starting adaptation to similar tasks. We evaluate Agentic-VLA on the LIBERO benchmark, achieving substantial improvements: +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and enabling cross-task transfer from 0% to 31.2% without task-specific demonstrations. Our framework also demonstrates 2.4x faster convergence compared to existing online adaptation methods. Beyond LIBERO, Agentic-VLA retains its advantage on the dual-arm RoboTwin 2.0 benchmark, including under its randomized Hard setting. These results establish Agentic-VLA as a significant step toward truly adaptive VLA systems capable of continuous learning in deployment.

[AI-69] nsor Cache: Eviction-conditioned Associative Memory for Transformers

链接: https://arxiv.org/abs/2605.22884
作者: Kabir Swain,Sijie Han,Daniel Karl I. Weidele,Mauro Martino,Antonio Torralba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens entirely, so relevant evidence outside the window becomes inaccessible. We introduce \emphTensor Cache, a two-level cache that pairs sliding-window softmax attention as a first-level cache (L1) with a fixed-size outer-product fast-weight memory as a second-level cache (L2) fed by KV pairs evicted from the window. Recent tokens remain in exact local attention; evicted pairs are compressed into a per-layer matrix A and read by future queries through a single matrix multiplication, exploiting the linear-attention identity q_t(k_i \otimes v_i)=\langle q_t,k_i\rangle v_i . A learned scalar gate fuses the L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known; our contribution is their use as an L2 cache fed exclusively by sliding-window evictions, plus identifying that the common chunked-mean training shortcut A!\leftarrow!\lambda A!+!\eta(\bar k!\otimes!\bar v) silently introduces C^2-C spurious cross-token outer products per chunk, and closing the gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics, Tensor Cache improves the memory–quality frontier over bounded-state baselines.

[AI-70] Energy per Successful Goal: Goal-Level Energy Accounting for Agent ic AI Systems

链接: https://arxiv.org/abs/2605.22883
作者: Deepak Panigrahy,Aakash Tyagi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: 34 pages, 16 figures, 10 tables

点击查看摘要

Abstract:Current AI energy benchmarks measure consumption at the granularity of a single model invocation or training run. For classical single-turn workloads this unit remains coherent. For agentic systems - where a single user goal may trigger multi-step orchestration, tool calls, retries, and failure-recovery cycles - the invocation count is an implementation artifact rather than a task property, and inference-level normalization misrepresents the energy cost of goal completion. We present A-LEMS (Agentic LLM Energy Measurement System), a cross-layer measurement framework that redefines the unit of AI energy accounting from energy per inference to Energy per Successful Goal (EpG). EpG aggregates total workflow energy across all execution attempts, including failures and retries, normalized by successfully completed goals. A-LEMS formalizes energy attribution through a temporal boundary model, a five-layer observation pipeline mapping RAPL signals to workflow-level energy, and a reproducibility protocol binding every measurement to hardware and runtime configuration. Building on EpG, we define the Orchestration Overhead Index (OOI), isolating the energy cost of orchestration relative to linear execution under identical task criteria. Across five reasoning and three tool-augmented task families, agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines (888.1 J vs 205.3 J). This overhead is driven by orchestration structure, not inference compute. For tool-augmented tasks, OOI inverts below 1.0x: agentic execution is cheaper than linear, confirming the metric captures orchestration structure rather than a fixed upward bias. These findings establish that energy-per-inference is insufficient for agentic AI. EpG and OOI provide the measurement foundation for accurate benchmarking, where orchestration structure is the primary determinant of energy cost. Comments: 34 pages, 16 figures, 10 tables Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2605.22883 [cs.AI] (or arXiv:2605.22883v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.22883 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-71] RMA: an Agent ic System for Research-Level Mathematical Problems

链接: https://arxiv.org/abs/2605.22875
作者: Zelin Zhao,Bo Yuan,Jaemoo Choi,Yongxin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present \textbfResearch Math Agents (RMA) , an agentic framework for automated reasoning on research-level mathematical problems. Unlike prior studies centered on competition mathematics or formal theorem proving, RMA targets research-level mathematical problems that require long-horizon reasoning, literature grounding, and iterative proof refinement. RMA decomposes research-level proof solving into specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, all coordinated by initializer, proposer, and verifier agents through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow, collaboratively generating, refining, and verifying candidate proofs through iterative feedback. We evaluate RMA on the First Proof benchmark, which consists of ten research-level problems contributed by expert mathematicians across diverse domains. Through comprehensive expert evaluation, RMA outperforms strong baselines on the First Proof benchmark, including GPT-5.2R and Aletheia, solving eight out of ten research problems and producing more logically sound and readable proofs. Our comprehensive ablation studies further show that performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback, rather than any single component. Our solutions and implementations will be made publicly available upon acceptance.

[AI-72] NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic

链接: https://arxiv.org/abs/2605.22874
作者: Paapa Kwesi Quansah,Ernest Bonnah
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Effectively translating between natural language (NL) and formal logics like Linear Temporal Logic (LTL) requires expertise that limits formal verification’s reach in safety-critical development. Template-based approaches sacrifice expressiveness for reliability; neural methods achieve fluency but provide no correctness guarantees. We present NeuroNL2LTL, a neurosymbolic architecture unifying learned translation with formal verification. NeuroNL2LTL routes translation through an intermediate representation whose mapping to LTL is structure-preserving by construction. Generated specifications undergo satisfiability and non-triviality checking; a minimal-edit repair mechanism corrects near-miss outputs before they reach downstream tools. The central innovation is verifier-in-the-loop training: verification outcomes serve as reward signals for reinforcement learning, producing neural components that optimize directly for formal correctness. On 200,000+ requirements spanning aerospace, robotics, autonomous vehicles, and ten additional domains, NeuroNL2LTL achieves 28% semantic equivalence with reference specifications while ensuring 86% of outputs are verified satisfiable. The system also generates contextually grounded explanations from LTL, enabling domain experts to validate specifications without specialized training. This work demonstrates that formal verification can function as both training objective and runtime filter for neural specification systems, allowing us to build neural-based tools whose reliability derives from logical guarantees rather than statistical confidence.

[AI-73] Approximate Machine Unlearning through Manifold Representation Forgetting Guided by Self Mode Connectivity

链接: https://arxiv.org/abs/2605.22871
作者: Weiqi Wang,Zhiyi Tian,Chenhan Zhang,Luoyu Chen,Shui Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Machine unlearning is a fundamental mechanism that enforces the right to be forgotten. Existing unlearning studies that rely on label manipulation or task-gradient reversal often deliver limited unlearning effectiveness. Moreover, they can undermine the original learning objective and typically do not guarantee equivalence to standard unlearning by retraining. In this paper, we propose \textbfManiF-SMC (\textbfManifold \textbfForgetting with \textbfSelf \textbfMode \textbfConnectivity), motivated by the observation that a model retrained on the remaining data tends to classify erased samples by their semantic similarity to the retained data. We begin with systematically recasting the approximate unlearning as pushing each erased sample away from its original learned manifold representation centroid toward its nearest semantic neighbors in the retained data. This reformulation aligns unlearning with retraining behavior and operates purely in representation space, reducing reliance on labels and task-specific gradients. To tackle the manifold representation-based unlearning problem, ManiF-SMC encapsulates the unlearning and representation preservation goals in a margin-based triplet loss. Because finding a suitable margin for unlearning is challenging, we propose a self-mode-connectivity module that rapidly reconstructs the local manifold to guide the adaptive margins generation for each unlearning case. Extensive experiments on four representative datasets show that ManiF-SMC achieves unlearning effectiveness comparable to state-of-the-art approximate methods while operating solely within the model’s representation space. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2605.22871 [cs.LG] (or arXiv:2605.22871v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22871 Focus to learn more arXiv-issued DOI via DataCite

[AI-74] BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems

链接: https://arxiv.org/abs/2605.22866
作者: Joss Armstrong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages, 10 figures, 20 tables

点击查看摘要

Abstract:Compound AI systems route tasks through hierarchies of specialised components. Attribution is dominated by Shapley-based methods (SHAP), which decompose a coalition value function into per-component marginal contributions and require evaluation of the system on arbitrary component subsets. That requirement fails for third-party APIs, opaque endpoints, and agentic orchestrators that concentrate routing on a few tools, leaving most coalitions un-evaluable from the deployed orchestrator. We introduce BOHM, which extracts a hierarchical attribution tree directly from the routing weights such systems already maintain: leaf attribution is the path product of root-to-leaf routing weights; level-k attribution is the induced distribution over depth-k nodes. The method has zero marginal cost, requires no access to component internals, and provides multi-resolution attribution at every level simultaneously, which flat methods cannot offer at any evaluation budget. BOHM and SHAP answer different questions and converge when the deployed router routes near-optimally. On 18 LLMs in a 3-level hierarchy over 880 LiveCodeBench problems, BOHM yields Kendall tau=0.928; SHAP reaches tau=0.980 at 9,000x more coalition evaluations per seed. On a 5-driver, 7-benchmark agentic study (35 cells, complete coverage), drivers concentrate routing on a single tool (top-share median 0.65), and cell-level tau(BOHM,SHAP) is predicted by whether the driver’s top pick is the empirically best tool (mean +0.22 vs ~+0.01). On a US Census hierarchy (475 leaves, 4 levels), BOHM recovers ground-truth rankings at every level (tau up to 0.722). BOHM satisfies efficiency, monotonicity, symmetry, and weak suppression but not Shapley’s additivity. It is best understood as a complementary primitive: a multi-resolution decomposition computable wherever routing state exists, whose disagreement with Shapley is itself diagnostic.

[AI-75] Expressive Power of Deep Homomorphism Networks over Relational Databases

链接: https://arxiv.org/abs/2605.22852
作者: Moritz Schönherr,Balder ten Cate,Maurice Funk,Benny Kimelfeld,Carsten Lutz,Arie Soeteman
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:The expressive limitations of message-passing Graph Neural Networks (GNNs) have motivated a wide range of more powerful graph learning architectures. We advocate Deep Homomorphism Networks (DHNs) as a model particularly well-suited for learning over relational databases, due to their close connection to important fragments of SQL such as conjunctive queries. We study the precise expressive power of DHNs by relating them to various natural fragments and extensions of first-order logic (FO). For DHNs with max, sum, and mean aggregations, we establish connections to the unary negation fragment (UNFO) and to the extensions of UNFO with counting quantifiers and with ratio quantifiers. We further relate sum-aggregation DHNs to the unary quantifier alternation fragment of FO and to an extension of FO with expressive counting. Through the classical correspondence between FO and SQL, these results also illuminate the relation between DHNs and SQL. They also enable us to study the decidability of two fundamental static analysis problems for DHNs, the emptiness problem and the subsumption problem. Finally, we confirm through experiments that the established differences in expressive power are reflected in the performance on suitable prediction tasks.

[AI-76] ObjectCache: Layerwise Object-Storag e Retrieval for KV Cache Reuse

链接: https://arxiv.org/abs/2605.22850
作者: Yu Zhu,Aditya Dhakal,Yunming Xiao,Dejan Milojicic,Gustavo Alonso
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache is often larger than what GPU memory and local DRAM can hold. To preserve latency, current systems keep the KV cache in remote DRAM pools, increasing serving-cluster size and cost. In this paper, we explore a different approach: storing the KV cache in S3-compatible object storage so that capacity is no longer the constraint, while minimizing the impact on TTFT. We propose ObjectCache, which co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it, overlapping data transfer with compute across concurrent requests. We prototype ObjectCache on a 100 Gbps RoCE cluster with NIXL (an inference library that abstracts storage and memory), Ceph RGW (an Object Gateway for clusters), and DAOS (an open source storage system). For 64K contexts, common in today’s systems, ObjectCache adds only 5.6% latency over local DRAM; for 4K contexts, where less compute is available to mask transfer, ObjectCache adds 56–75,ms over the optimal local layerwise baseline. Under shared bandwidth caps, our scheduler reduces added TTFT by 1.2–1.8x compared with equal bandwidth sharing.

[AI-77] he Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agent ic AI Systems

链接: https://arxiv.org/abs/2605.22842
作者: Tanzim Ahad,Ismail Hossain,Md Jahangir Alam,Sai Puppala,Syed Bahauddin Alam,Sajedul Talukder
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper is presently under review at a top-tier security venue

点击查看摘要

Abstract:Multi-agent AI pipelines typically assume that agent misconduct originates from model misalignment. We identify a structural failure in this assumption, the \emphMisattribution Gap, where memory-layer attacks produce behaviors indistinguishable from model failure, causing defenders to apply the wrong remediation. We formalize \emphSemantic Norm Drift (SND) as a third path to agent misconduct, distinct from emergent misalignment and collusion. In SND, a policy-formatted document enters a shared vector store through normal uploads and later reappears as trusted system context after provenance is lost through a Trust Laundering Chain. Across 64 documented failures, attribution systems consistently blamed the model. Four safety classifiers, including one trained on memory poisoning, produced zero detections across 510 checkpoints. In 59 of 65 valid cases, agents explicitly cited the injected document as normative authority before complying. The attack requires no trigger, model access, or repeated interaction, achieves full effect within five sessions, and persists indefinitely. We introduce Counterfactual Composition Testing, which identifies the causal entry with 87.5% accuracy and zero false positives, while a forensics baseline fails across all 25 scenarios. We further prove the Retrieval-Coverage Dilemma, showing that stronger evasion inherently weakens the attack, limiting adaptive bypass strategies. Finally, we propose Memory-Persistent Information-Flow Control, which blocks 97% of attacks at the cross-session boundary where prior defenses fail. We release the SND Corpus, the first adversarial memory benchmark with temporal persistence and multi-agent composition across financial and Health Care domains.

[AI-78] KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions

链接: https://arxiv.org/abs/2605.22825
作者: Masoud Shokrnezhad,Tarik Taleb,Yan Chen,Qize Guo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Key Value Indicators (KVIs) provide a decision oriented view of a service by summarizing how operational performance translates into stakeholder value, risk, and outcomes. However, in many domains KVIs are difficult to compute in practice because they require selecting relevant KVI categories, defining measurable Key Performance Indicators (KPIs), collecting KPI values, and applying consistent calculation logic, all of which is typically performed manually and inconsistently from unstructured service documentation. This paper presents KPI2KVI, a tool that transforms a natural language service description into computed KVI estimates by orchestrating a deterministic multi agent workflow powered by Large Language Models (LLMs) that (i) elicits missing service context, (ii) extracts and finalizes relevant KVI categories from a taxonomy, (iii) generates service specific KPIs with units and descriptions, (iv) collects KPI values through an interactive dialogue and also supports intelligent estimation for KPI values that are unavailable, and (v) computes interval valued KVI outputs (minimum, exact, maximum) with traceable explanations for each KVI code. Simulations with representative service descriptions demonstrate that KPI2KVI consistently produces a complete end to end mapping from description to KVI intervals and provides transparent calculation narratives that support post hoc auditing and interactive advisory queries.

[AI-79] An AI-Driven Framework for Energy-Efficient Environmental Monitoring in Smart Cities Using Edge Intelligence

链接: https://arxiv.org/abs/2605.22824
作者: Yichen Liu,Imam Akintomiwa Akinlade,Xiaochong Jiang,Wenting Yang,Shiqi Yang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Environmental monitoring is a crucial component of the smart city infrastructure. It enables informed decision making which enhances sustainability, public health and urban planning. However, the large-scale deployments of the smart sensors have raised concerns on excessive energy consumption and redundant data collection as well as limited sensor lifespan. To resolve these issues, we present an AI-driven framework for energy-efficient environmental monitoring in smart cities utilizing edge intelligence. Our proposed framework leverages TinyML-enabled edge devices and context-aware adaptive decision-making in order to dynamically activate the sensors based on the spatiotemporal conditions, environmental statistics and energy constraints. The sensors will be dynamically activated based on a utility function that takes in factors such as real-time environmental conditions, sensor location, and remaining battery lifespan. Our framework will reduce unnecessary sensing and communication while maintaining high coverage for monitoring. We introduce a hierarchical Edge Intelligence architecture to support deployments in city-wide scales. We conducted evaluation using a city-scale simulation driven by real multi-sensor environmental traces, which demonstrates that the proposed mechanism significantly reduces energy consumption and extends sensor lifespan when compared to static, periodic, and UCB-based adaptive sensing strategies. The results highlight the potential of edge intelligence and adaptive AI techniques for building sustainable and efficient smart city monitoring systems.

[AI-80] RA-DCA: A Randomized Active-Set DCA for Directional Stationarity in Max-Structured DC Programs

链接: https://arxiv.org/abs/2605.23550
作者: Yi-Shuai Niu
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 40 pages, 7 figures

点击查看摘要

Abstract:We study nonsmooth difference-of-convex programs whose subtracted convex term is a finite maximum of smooth convex functions. In this setting, standard DCA iterations may converge to critical points that are not directionally stationary, whereas exact active-vertex screening can be expensive when active sets are large or combinatorial. We propose RA-DCA, a vertex-first randomized active-set DCA that projects active gradients onto sampled directions, checks a sampled vertex residual, and uses a small linear program only as a low-residual convex-combination fallback. The method preserves the descent structure of DCA and reduces the randomized screening layer to matrix multiplications. Under the stated regularity, numerical active-set consistency, and random-embedding assumptions, every accumulation point generated by the safeguarded method is directionally stationary with probability one. MATLAB experiments first test the theorem on degenerate max-affine, max-quadratic, and sparse support-function models, where the safeguard avoids nonstationary critical points and closely tracks a full active-vertex scan. Block top-k tests then show that the same screening idea remains useful when exact aggregate enumeration is combinatorial. Trimmed-regression, complementarity, and QUBO diagnostics separate cases where active-set selection helps from cases dominated by multistart search, the DC split, or other problem-specific features.

[AI-81] Generative AI and the Reorganization of Labor Demand

链接: https://arxiv.org/abs/2605.23159
作者: Fangyan Wang,Zaiyan Wei,Yang Wang
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) is expected to transform work, but less is known about how firms reorganize labor demand as the technology diffuses. Existing research has largely focused on which occupations are exposed to AI or whether exposed jobs decline. We extend this debate by examining whether firms adjust by changing where they hire, what jobs contain, or both. Using a nationwide dataset of job postings in the United States, covering all sectors of the economy, we construct a dynamic, posting-level measure of generative AI exposure with a two-stage large language model pipeline. The pipeline identifies the tasks described in each posting and classifies the extent to which generative AI can perform or assist them. We then decompose changes in aggregate exposure into two margins: reallocation of demand across jobs and redesign of tasks within jobs. We document three main findings. First, generative AI exposure is dynamic rather than fixed, changing substantially over time. Second, labor demand adjusts through both margins. Hiring reallocation explains the largest share of the aggregate decline in exposure, accounting for 52% on average, while within-job redesign becomes increasingly important, accounting for 39.5%. A complementary Oaxaca-Blinder decomposition shows that shifts in occupational composition account for about 90% of the exposure change attributable to observable job characteristics. Third, adjustment differs across the job ladder. Senior jobs adjust earlier and mainly through reallocation, whereas junior jobs adjust through a broader mix of reallocation, redesign, and their interaction. These findings suggest that labor-market adjustment to generative AI is a process of organizational reconfiguration, in which firms reshape both hiring demand and the task architecture of work.

[AI-82] Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning

链接: https://arxiv.org/abs/2605.23138
作者: Gino Kwun,Dhanvi Bharadwaj,Gokul Subramanian Ravi
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 22 pages, 4 figures

点击查看摘要

Abstract:Variational Quantum Algorithms (VQAs) potentially offer a pathway to practical quantum advantage, but their optimization is heavily hindered by barren plateaus and numerous local minima. While classically simulable Clifford circuits can warm-start VQAs to accelerate convergence, existing heuristic-based initialization methods struggle to scale within vast combinatorial search spaces. To overcome this bottleneck, we propose CRiSP (a Clifford Reinforcement Learning agent for State Preparation), a framework that formulates discrete prefix selection as a sequential decision-making problem. CRiSP utilizes Neural-Guided Monte Carlo Tree Search, driven by a Transformer-based policy trained via self-play, to insert learned Clifford gates before fixed parameterized rotations. This enables the construction of high-quality initial states entirely through polynomial-time classical stabilizer simulation without altering the underlying circuit architecture. By integrating a curriculum learning strategy that progressively expands the search horizon, the agent efficiently scales to deep circuits. Evaluated on QAOA benchmarks of up to 22 qubits and 1,370 parameters, CRiSP outperforms state-of-the-art Clifford initialization methods by a mean of 3.17\times (max 45.02\times ) in average energy accuracy and 2.44\times (max 16.01\times ) in best-achieved energy accuracy. Assessments on VQE tasks further demonstrate the framework’s robustness and generalizability.

[AI-83] KAPLAN: Kolmogorov-Arnold Prognostic Learnable Activation Networks for Survival Analysis NEURIPS2026

链接: https://arxiv.org/abs/2605.23082
作者: Stelios Boulitsakis Logothetis,Angela Wood,Pietro Li ò
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 13 supplementary pages. Submitted to NeurIPS 2026

点击查看摘要

Abstract:Survival analysis aims to model how covariates and time jointly shape the time-to-event distribution under right censoring. Classical methods such as the Cox model and generalised additive models (GAMs) require interactions and time-varying effects to be manually specified, which is increasingly impractical on rich clinical datasets. We introduce KAPLAN-HR, a B-spline Kolmogorov-Arnold Network (KAN) for nonparametric estimation of the conditional hazard as a joint function of covariates and time. A single-layer KAPLAN-HR model recovers a GAM, while deeper architectures capture interactions and time-varying effects through composition. We establish a convergence rate for the nonparametric KAN hazard estimator that depends only on the smoothness of the underlying KAN representation and not on the covariate dimension, thereby mitigating the curse of dimensionality for KAN-representable targets. In evaluations over six clinical benchmark datasets, KAPLAN-HR matches or exceeds the predictive performance of established statistical and deep learning survival methods.

[AI-84] MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models

链接: https://arxiv.org/abs/2605.23007
作者: Yurii Kvasiuk,Tianyi Li,Owen Colegrove,Moritz Münchmeyer
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
备注:

点击查看摘要

Abstract:We explore the application of LLM-driven algorithm optimization to several common tasks in quantitative finance. MadEvolve, a general-purpose algorithm optimization framework inspired by DeepMind’s Alpha-Evolve, was recently developed to optimize algorithms in computational cosmology. Here we demonstrate the utility of MadEvolve to optimize algorithmic trading strategies and alpha generation at the example of Bitcoin trading. On our simulation and backtesting setup, we achieve significant improvements on all tasks we considered, such as evolving feature sets for signal generation, optimizing separate components of the trading strategy, and jointly evolving the feature pipeline together with the execution strategy. Additionally, we compare our method to other agentic search approaches, specifically Claude Code, and carefully evaluate p-hacking probabilities on our simulation setup. Our findings strongly support the utility of AI-driven agentic and evolutionary algorithms for algorithmic trading and quantitative finance.

[AI-85] Staging by the Book: Automatic Sleep Stage Classification Using Scoring Rules

链接: https://arxiv.org/abs/2605.22859
作者: Emil Hardarson,Konstantin Popov,Sigridur Sigurdardottir,Anna Sigridur Islind,Erna Sif Arnardóttir,María Óskarsdóttir
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated sleep staging is commonly approached as a supervised machine learning problem, with deep learning methods dominating recent research. While machine learning models achieve near-human level agreement with human-scored reference sleep stages, their decisions are typically opaque and not designed to follow clinical scoring rules. We propose a transparent alternative: a deterministic, rule-based sleep staging method that explicitly operationalizes the American Academy of Sleep Medicine’s (AASM) scoring logic as executable code, coupled with epoch-level natural-language justifications derived from an explanation trace. We evaluate the approach on 50 polysomnography recordings with a 10-scorer majority-vote consensus as reference. Across all recordings, the method agreed with the majority-vote reference in 60.5% of epochs ( \kappa=0.42 ), with substantially higher agreement on a dataset used during development (77.1%, \kappa=0.61 ). Agreement with the reference was highest for sleep stage N2 (recall 83.5%) and moderate for sleep stage R (recall 68.7%), while Wake and N1 recall were low. Despite lower agreement with the reference than contemporary deep learning models, the method provides deterministic decisions and natural language explanations aligned with AASM scoring rules, making it a complementary tool for auditing, debugging, and governing deep learning-based sleep staging.

[AI-86] PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels

链接: https://arxiv.org/abs/2605.22856
作者: Berkay Guler,Giovanni Geraci,Hamid Jafarkhani
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Channel foundation models assume access to fully observed channels, an assumption that fails in deployment. We introduce PilotWiMAE, a self-supervised framework whose encoder ingests noisy pilot observations directly and whose attention factorizes along the axis separating temporal from joint space-frequency processing, an inductive bias inspired by the physics of the problem. Pilot input shrinks the observation space by up to two orders of magnitude and also removes the unrealistic assumption of full-CSI availability while incurring lower latency. The factorized design generates robust representations by exploiting the separable channel structure and allows a pretraining mask ratio of 99% . We pair patch-normalized reconstruction, which captures small-scale fading structure, with an auxiliary scale loss that recovers the large-scale fading features, and use an AWGN curriculum to match pilot noise at pretraining and deployment. Pretrained solely on 3.5 ,GHz and evaluated at 28 ,GHz across in-distribution and out-of-distribution settings, PilotWiMAE’s cross-frequency beam selection and channel characterization beat supervised baselines despite operating on a smaller observation space. To weaken the coupling between decoder capacity and representation quality, we further propose a decoder-centric pretraining stage following the encoder-decoder joint pretraining, which allows PilotWiMAE to demonstrate competitive channel estimation without sacrificing representation quality. To foster further work in this direction, we release the PilotWiMAE pretrained weights and training pipeline, together with CSIGen, our Sionna-based ray-tracing channel-generation tool, and the channel datasets used in this work.

[AI-87] he Cognitive Kardashev Scale: Quantifying the Material Envelope of Civilisational Computation

链接: https://arxiv.org/abs/2605.22840
作者: Sachin Sharma
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:How much thinking can a civilisation do? Kardashev’s (1964) typology ranks civilisations by total power: planetary (Type I, ~10^16 W), stellar (Type II, ~10^26 W), galactic (Type III). This paper builds an analogous Cognitive Kardashev Scale: how much sustained AI-grade computation each tier could support. Four ingredients enter the calculation: total power P (watts), the share f of it devoted to cognition, the efficiency \eta at which energy becomes compute (operations per joule), and the brain’s own processing rate C_\mathrmbrain as a reference unit. Anchoring on 2024-2026 hardware (El Capitan, NVIDIA Blackwell, Vera Rubin) gives \eta_2026 = 10^12 FLOP/J. Contemporary humanity sits at K \approx 0.73 , three-quarters of the way to Type I. At Type I and f = 1% , available compute is, within an order of magnitude, one personal AI’s worth of cognition per human inhabitant; at Type II it is essentially incomprehensible. Three trajectories for frontier compute through 2035 are reported as conditional projections, not predictions. Whether the long-run binding constraint is energy or efficiency depends on engineering choices not yet made; the political economy of who has access may matter more than either.

机器学习

[LG-0] Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

链接: https://arxiv.org/abs/2605.23893
作者: Hongwu Peng,Ohiremen Dibua,Yuanjun Xiong,Yifan Gong,Jianming Zhang,Yan Kang
类目: Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:We propose Complete-muE, a framework which targets hyperparameter transfer across dense FFN and any Mixture-of-Experts (MoE) setups in transformer blocks. Existing tools such as \mu P (requires fixed architectue) or SDE (requires fixed per-step token count) cannot directly solve the hyperparameter transfer problem in MoE setups because Dense to MoE transfer or MoE total experts scaling changes both architecture and tokens per expert. Complete-muE solves this challenge with a two-bridge system: Bridge~I maps between dense FFN and Dense MoE by active-width \mu P with a normalized router scale. Bridge~II maps between Dense MoE and sparse MoE by activated-expert scaling, where the first-order SDE LR/WD correction cancels while a bounded residual \sigma_0 shift remains. The resulting transfer rule, which we term as Complete muE, covers changes in activated experts, total capacity, granularity, and shared/group-balanced hybrids for MoE models as well as network width/depth, batch size, and duration changes for general Transformer models. Extensive language model and diffusion model pretraining experiments confirm that complete-muE yields relatively stable hyperparameter optima across model architectures and parameter counts – with only minor drift consistent with the non-strict SDE behavior of Bridge~II. In practice this drift is small enough that hyperparameters tuned on a single dense reference transfer near-optimally to all MoE configurations – \emphtune dense once, transfer to all is the practical recipe at the core of Complete-muE. This enables MoE models to achieve accelerated convergence speedup over dense models when scaling model capacity without costly hyperparameter search.

[LG-1] raining-Free Looped Transformers

链接: https://arxiv.org/abs/2605.23872
作者: Lizhang Chen,Jonathan Li,Chen Liang,Ni Lao,Qiang Liu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce training-free looped transformers, in which a lightweight inference-time wrapper loops a contiguous mid-stack block of layers of a frozen checkpoint without additional fine-tuning, continued training, or architectural changes. Unlike prior looped transformer methods that train with the looped structure end-to-end, we retrofit recurrence onto pretrained models at test time. We show that naive block reapplication usually degrades performance, highlighting the importance of the loop application strategy. Motivated by viewing a pre-norm transformer block as a forward Euler step on an ODE, we instead treat looping as a refinement of the same approximation, replacing one large update with smaller damped sub-steps. Across seven dense, sparse MoE, and MLA+MoE model families, our method improves Qwen3-4B-Instruct by +2.64 pp on MMLU-Pro, Qwen3-30B-A3B-Instruct by +1.14 pp on CommonsenseQA, and Moonlight-16B-A3B-Instruct by +1.20 pp on OpenBookQA.

[LG-2] Entrywise Error Bounds for Spectral Ranking with Semi-Random Adversaries KDD

链接: https://arxiv.org/abs/2605.23854
作者: Dongmin Lee,Anuran Makur,Japneet Singh
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 17 pages, 2 figures, 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2

点击查看摘要

Abstract:Bradley-Terry-Luce (BTL) model estimation is a well-established strategy to rank a collection of items given a dataset of pairwise comparisons. Although the theoretical performance of BTL estimation methods, such as spectral and maximum likelihood estimation, is well studied in the regime of uniformly sampled graphs, generalizing such results to a wider class of random graphs has proved challenging. In this work, we investigate the entry-wise error of spectral algorithms against a semi-random adversary that can arbitrarily boost the sampling probabilities of certain edges. We find that the performance of the unweighted spectral method is heavily dependent on the spectral properties of the generated graph. Furthermore, we show that asymptotic performance approaching that of uniformly sampled graphs can be recovered by appropriately reweighting the observed edges to counteract the adversary and restore the spectral gap. Finally, we provide numerical simulations that support our theoretical findings.

[LG-3] Advanced AI Service Provisioning in O-RAN through LLM Engine Integration

链接: https://arxiv.org/abs/2605.23809
作者: Seyed Bagher Hashemi Natanzi,Pranshav Gajja,Bo Tang,Vijay K. Shah
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Open Radio Access Network (O-RAN) architecture allows AI to be embedded directly into the RAN through modular xApps and rApps, yet creating these applications collecting data, training models, writing code, and deploying them safely remains slow and largely manual. Large Language Models (LLMs) offer strong reasoning and code-generation capabilities but are unsuited for the fast, deterministic inference required in real-time RAN control. We present a proof-of-concept Dual-Brain architecture that combines both strengths: an LLM-based orchestrator translates operator intents into data-collection policies and deployment code, while an automated ML engine, NeuralSmith, trains lightweight classifiers on demand via an API. We describe the architecture and provisioning workflow, share practical insights from a containerized O-RAN 5G~SA testbed, and discuss open research directions.

[LG-4] LLM -driven design of physics-constrained constitutive models: two agents are better than one

链接: https://arxiv.org/abs/2605.23754
作者: Marius Tacke,Matthias Busch,Kian Abdolazizi,Jonas Eichinger,Kevin Linka,Roland Aydin,Christian Cyron
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developing constitutive models that capture how materials deform under load traditionally requires years of specialized expertise in continuum mechanics, machine learning, and scientific programming. Large language models (LLMs) have recently been shown to lower this barrier by generating constitutive models on demand, but existing single-agent pipelines lack systematic checks that the resulting models respect fundamental physical laws. To close this gap, we introduce the first multi-agent LLM-driven approach for constitutive model generation: a Creator agent proposes a model tailored to the data, while an Inspector agent critically audits each proposal against nine physical constraints and returns it for refinement whenever a violation is detected. We demonstrate this concept with constitutive artificial neural networks (CANNs) and benchmark it on brain tissue, experimental rubber, and synthetic rubber, using two different LLM backbones (Claude Opus 4.7 and Kimi K2.5). Adding the Inspector raises the share of exported models that truly satisfy all physical constraints from 91% to a perfect 100% for Opus and from 37% to 56% for Kimi, while preserving near-baseline accuracy and remarkable generalization to unseen loading paths. In combination, the generated models are physically valid, highly accurate, and extrapolate reliably beyond the training data - properties that together make them directly usable in practice. Separating generation from inspection thus turns LLM-driven constitutive modeling into a genuinely trustworthy process. The paradigm is deliberately technique-agnostic and scales automatically with advances in LLM capability, opening a promising path toward automated, physics-aware model discovery.

[LG-5] SeedER: Seed-and-Expand Retrieval from Knowledge Graphs

链接: https://arxiv.org/abs/2605.23753
作者: Hamed Shirzad,Frederik Wenkel,Dominique Beaini,Danica J. Sutherland,Emmanuel Noutahi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) offer a rich representation for relational knowledge, but their irregular structure makes retrieval challenging: ego-graph expansion grows rapidly, and dense embedding methods struggle with multi-hop compositional queries. Existing agent-based graph exploration approaches, while expressive, are often too expensive for large-scale retrieval. We introduce SeedER (Seed-and-Expand Retrieval), a retrieval framework that explicitly leverages KG structure through iterative, low-cost expansion. SeedER first seeds a compact set of core nodes using lightweight dense and entity-based retrieval, then selectively expands this set via a learned graph-aware policy trained with reinforcement learning. This design decomposes global reasoning into reusable local decisions, enabling efficient discovery of query-relevant nodes while tightly controlling expansion cost. We show theoretical limitations of dense retrieval on compositional graph queries, and establish advantages of SeedER from both compositional generalization and graph-constrained submodular optimization perspectives. Empirically, SeedER substantially improves recall with compact candidate sets over strong dense and graph-augmented baselines, making it an effective first-stage retriever for knowledge-intensive reasoning systems.

[LG-6] Approaching I/O-optimality for Approximate Attention

链接: https://arxiv.org/abs/2605.23751
作者: Pál András Papp,Aleksandros Sobczyk,Anastasios Zouzias
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit the I/O complexity of attention in large language models. Given query-key-value matrices Q,K,V\in\mathbbR^n\times d , and a machine with fast memory size M , the goal is to compute the “attention matrix” A=\textsoftmax(Q K ^\top/\sqrtd) V with the minimal number of data transfers between fast and slow memory. Existing methods in the literature, most notably FlashAttention and its variants, incur an I/O cost that depends quadratically on n , while a trivial lower bound only requires \Omega(nd) I/O’s to read the inputs and write the output. In this work, we present a technique for computing attention where the I/O cost only depends almost-linearly on n in most parameter regimes. This is achieved by developing I/O-efficient algorithms inspired by the recent approximate attention framework of Alman and Song. We also prove corresponding lower bounds in each parameter regime to show that our algorithms are indeed close to I/O-optimal.

[LG-7] Contrast to Detect: Dynamic Graph Contrastive Regularization for Unsupervised Anomaly Detection in Multivariate Time Series

链接: https://arxiv.org/abs/2605.23744
作者: Yunhua Pei,Zixing Song,Jin Zheng,John Cartlidge
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures. Preprint. Code and demo data available online

点击查看摘要

Abstract:Anomaly detection in multivariate time series (MTS) is hindered by dynamic inter-variable dependencies and feature entanglement under spectral noise, and in practice, is further complicated by the absence of anomaly labels. Existing reconstruction-based detectors tend to recover anomalies as faithfully as normal patterns, while prevailing graph contrastive methods enforce invariance across views and thus assume a stationary relational structure, an assumption that breaks under structural drift in real systems. We propose ContrastAD, an unsupervised framework that turns structural evolution itself into a learning signal rather than suppressing it. A Multi-Perspective Embedder encodes inputs from temporal, attribute, and structural perspectives. A Frequency-Aware Attention Mixer then performs spectral top-K filtering before attention, preventing noise from leaking into query-key similarities. The core component, a Dynamic Graph Contrastive Learner, builds power-law-inspired sparse graph snapshots from batch-level DTW distances and contrasts the most divergent pair against a stable anchor, regularizing the latent space without imposing rigid invariance. Across five real-world benchmarks, ContrastAD attains the highest mean F1 on all five datasets and the highest AUC on three (SWaT 93.60, SMD 98.66, PSM 97.79), with statistically significant F1 and AUC margins over the strongest baseline on SWaT and PSM. On MSL and SMAP, it trails the AUC leader by under 0.7 points while still leading on F1. Ablation and sensitivity studies further confirm that the contrastive objective works best as a soft regularizer, supporting our claim that strict invariance is suboptimal under non-stationary dynamics.

[LG-8] Optimal Dimension-Free Sampling for Regularized Classification

链接: https://arxiv.org/abs/2605.23726
作者: Meysam Alishahi,Alexander Munteanu,Simon Omlor,Jeff M. Phillips
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We prove optimal sampling bounds achieving (1\pm\varepsilon) -relative error for a broad class of Lipschitz continuous classification loss functions under various regularization terms. This includes important functions such as logistic and sigmoid loss, hinge loss, and ReLU loss, as prominent and popular representative examples. In particular, we prove k^2/\varepsilon^2 upper and lower bounds for |\cdot|_2/k regularization, and k/\varepsilon^2 upper and lower bounds for |\cdot|_1/k regularization. For |\cdot|_2^2/k regularization, the sampling complexity depends mainly on a bounded derivative property: if |g’(x)|\leq g(x) , and g(0)0 , and g is monotonic or convex, then it admits linear in k sampling complexity; otherwise the general bound is k^2/\varepsilon^2 . However, if g(0)=0 , our results indicate that no dimension-free bounds are possible, and even sublinear bounds are ruled out. All upper bounds are complemented by matching lower bounds up to polylogarithmic terms. Moreover, our work relies conceptually and algorithmically on simple uniform or (squared) norm sampling and hereby improves over recent cubic k^3/\varepsilon^2 sensitivity sampling bounds of (Alishahi and Phillips, ICML’24). This is achieved by refined arguments involving higher moment bounds and empirical process analyses to avoid overcounting that appears in the de-facto standard VC-dimension and sensitivity framework.

[LG-9] Operator Learning for Reconstructing Flow Fields from Sparse Measurements: a Language Model Approach

链接: https://arxiv.org/abs/2605.23712
作者: Qian Zhang,George Em Karniadakis
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reconstructing flow fields from sparse measurements is a fundamental problem in fluid mechanics with broad implications for modeling, control, and design. In this work, we propose a novel operator learning framework that leverages the architecture of language models to perform flow reconstruction in a mesh-free manner. We reformulate flow field reconstruction as a sequence-to-sequence learning task, where sparse measurements are treated as context and unobserved locations as queries. Our model learns to reconstruct the full flow field from sparse inputs, effectively capturing spatial correlations and long-range dependencies. We evaluate the proposed approach on four benchmark datasets: (1) two-dimensional vortex street simulations, (2) daily average temperature data across the contiguous United States, (3) three-dimensional blood flow simulations based on dissipative particle dynamics, and (4) three-dimensional turbulent jet flow measurements obtained via particle tracking velocimetry. Across all cases, our method demonstrates competitive reconstruction accuracy, even with highly incomplete data (less than 10% observed), and achieves efficient performance. The results highlight the potential of language models as robust and scalable tools for scientific data reconstruction, and suggest a promising direction toward the development of foundation models for scientific and engineering applications.

[LG-10] Learning Dynamic Stability Landscapes in Synchronization Networks

链接: https://arxiv.org/abs/2605.23708
作者: Christian Nauck,Junyou Zhu,Michael Lindner,Frank Hellmann
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 22 pages, 12 figures

点击查看摘要

Abstract:The robustness of synchronization is typically characterized by scalar, per-node stability indices whose dependence on topology is studied via network science or graph neural networks (GNNs). We propose a novel upstream task, learning stability landscapes, which provide deeper insights into synchronization behavior and from which many such scalar indices can be derived. Crucially, we pioneer a graph-to-image prediction paradigm: learning image-like landscapes as per-node targets directly from graph topology, a formulation we are not aware of having been established elsewhere in the literature. To support this task, we release two datasets of 10,000 graphs each at 20 and 100 nodes with per-node landscape labels, based on a conceptual oscillator model, capturing power grid synchronization behavior. A GNN encodes topology and a CNN decoder renders per-node images, learned end-to-end with good in-distribution accuracy, generalizing across graph sizes and to realistic power grid topologies. This demonstrates that stability landscapes, while beyond the reach of conventional network science, are learnable from topology and open new avenues for moving beyond scalar stability indices in biology, neuroscience, and power grids.

[LG-11] Graph-based Complexity Forecasts in UK En Route Airspace Using Relevant Aircraft Interactions

链接: https://arxiv.org/abs/2605.23696
作者: Edward Henderson,George De Ath,Nick Pepper
类目: Machine Learning (cs.LG)
*备注: Accepted paper at the US-Europe Air Transportation Research Development Symposium (ATRD) 2026

点击查看摘要

Abstract:Effectively managing Air Traffic Control Officer (ATCO) workload is crucial in maintaining operational safety. Group supervisors use tools that estimate upcoming traffic load to aid decision-making. However, industry-standard models can fail to capture the nuances of upcoming air traffic complexity. This study presents a probabilistic approach to forecast the complexity of an airspace sector using the number of relevant aircraft pairs, i.e., those that require monitoring or deconfliction by a controller, as a proxy measure for ATCO workload. We adapted an existing filter algorithm to make it suitable for use in London Middle Sector (LMS), a complex airspace sector with multiple flows of traffic above some of the busiest airports in Europe. Through iterative feedback with ATCOs, the algorithm was refined and extended to handle specific geometric and operational considerations. The updated algorithm outperformed the original, with an F1-score of 0.84 compared to 0.69 on a labelled set of 50 traffic scenarios. To produce forecasts of future numbers of relevant aircraft pairs in the sector, a graph representation of the LMS route network was constructed, standardising the spatial fidelity of route legs. The forecasting method accounts for uncertainty in aircraft arrival times by modelling the probability of each aircraft occupying route segments at future query times. When combined with historic distributions of relevant interactions and a live operational data stream, predictions of upcoming ATCO workload could be made up to 45 minutes in advance. The proposed method to forecast upcoming workload showed a significantly stronger correlation with actual relevant interactions (Spearman’s \rho = 0.68 ) than a standard traffic volume prediction ( \rho = 0.55 ). The resulting data-driven tool shows promise for use by group supervisors to inform sector configuration and ATCO rostering decisions.

[LG-12] Optimization of randomized neural networks for transfer operator approximation

链接: https://arxiv.org/abs/2605.23689
作者: Mohammad Tabish,Stefan Klus
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:RaNNDy is a randomized neural network architecture for the data-driven approximation of transfer operators associated with complex dynamical systems. The weights and biases of the hidden layers of the network are randomly initialized and kept fixed, only the output layer is trained. This has several advantages over fully optimized neural networks, notably a closed-form solution for the output layer and significantly lower training costs. Despite these advantages, RaNNDy is restricted to the initial selection of weights and biases that parametrize the basis functions required for the operator approximation. Since the basis functions are determined by the activation function, choosing an appropriate activation function for the hidden layers is crucial. In this work, we propose an algorithm that optimizes the activation function itself, while keeping the weights and biases in the randomized neural network fixed, providing a more suitable dictionary. We illustrate the efficacy of the approach using various benchmark problems, including stochastic differential equations and random walks on graphons.

[LG-13] Relevant Walk Search for Explaining Graph Neural Networks ICML2023

链接: https://arxiv.org/abs/2605.23673
作者: Ping Xiong,Thomas Schnake,Michael Gastegger,Grégoire Montavon,Klaus-Robert Müller,Shinichi Nakajima
类目: Machine Learning (cs.LG)
*备注: Published in ICML 2023

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become important machine learning tools for graph analysis, and its explainability is crucial for safety, fairness, and robustness. Layer-wise relevance propagation for GNNs (GNN-LRP) evaluates the relevance of \emphwalks to reveal important information flows in the network, and provides higher-order explanations, which have been shown to be superior to the lower-order, i.e., node-/edge-level, explanations. However, identifying relevant walks by GNN-LRP requires \em exponential computational complexity with respect to the network depth, which we will remedy in this paper. Specifically, we propose \em polynomial-time algorithms for finding top- K relevant walks, which drastically reduces the computation and thus increases the applicability of GNN-LRP to large-scale problems. Our proposed algorithms are based on the \emphmax-product algorithm – a common tool for finding the maximum likelihood configurations in probabilistic graphical models – and can find the most relevant walks exactly at the neuron level and approximately at the node level. Our experiments demonstrate the performance of our algorithms at scale and their utility across application domains, i.e., on epidemiology, molecular, and natural language benchmarks. We provide our codes under \hrefthis https URLthis http URL_walk_gnnlrp.

[LG-14] Less Effort Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin

链接: https://arxiv.org/abs/2605.23643
作者: Matthias Cosler,Cas Cremers,Bernd Finkbeiner,Mohamed Ghanem,Niklas Medinger
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tools like Tamarin and ProVerif have achieved notable success in analyzing and verifying complex real-world protocols such as EMV, 5G, and WPA2, even detecting zero-day exploits. Despite these successes, verifying such protocols remains a time-consuming, challenging task, often requiring significant human effort and expertise. In this paper, we present a reinforcement learning (RL) framework inspired by AlphaZero and AlphaProof that implements a new style of proof search for Tamarin. We have developed a stateless API for Tamarin that acts as a classical RL environment. We guide a Monte Carlo Tree Search (MCTS) by a neural heuristic that learns from completed subproofs. We evaluate our framework on 16 case studies, ranging from classical protocol models to challenging state-of-the-art protocol models from recent publications. Our method finds more proofs automatically than Tamarin’s standard search and produces shorter proofs than both the standard and human-engineered heuristics. Our pipeline is applicable out of the box to assist Tamarin users in active research, reducing the human effort required. Moreover, our standardized interface provides a programmatic way for users to interact with Tamarin. Finally, our work demonstrates the promising potential of adapting RL-based methods to the Tamarin domain.

[LG-15] Valid and Expressive Copulas for Irregular Multivariate Time Series

链接: https://arxiv.org/abs/2605.23632
作者: Christian Klötergens,Tom Hanika,Lars Schmidt-Thieme,Vijaya Krishna Yalavarthi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce CopFITi, a copula model for probabilistic forecasting of irregular multivariate time series (IMTS). Our model combines the expressivity of normalizing flows for univariate marginals with the consistency and flexibility of a Gaussian Mixture Copula for the joint dependency structure. Our experiments show that copula-based approaches, which decouple the marginals from the joint, yield better marginal models than architectures that directly fit the full joint. With CopFITi, we propose the first IMTS copula that is marginalization-consistent by construction and establish a new state of the art in joint IMTS density modeling.

[LG-16] How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

链接: https://arxiv.org/abs/2605.23628
作者: Polina Gordienko,Georg Schollmeyer,Frauke Kreuter,Christoph Jansen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task benchmarks have become a central pillar of machine learning research, yet their growing influence has incentivised benchmark gaming – strategic actions taken to improve the leaderboard rank of a specific model. Treating datasets as voters and models as candidates, we consider benchmark-specific training – the inclusion of benchmark data in training – as a form of election manipulation. For any ordinal benchmark, the problem of choosing datasets to train on so that a target model becomes top-ranked corresponds to shift bribery, a class of manipulation problems from computational social choice. Leveraging this identification, we show that the benchmark-specific training problem is NP-hard under Borda count and mean win rate. Complementing this worst-case perspective, we introduce the instance-level robustness, the minimum number of datasets a model developer must include in training to top a given leaderboard, and derive expressions for it under arithmetic mean, median, mean win rate and pairwise majority. We evaluate these expressions on MMLU under HELM and on BIG-Bench Hard (BBH) under the Open LLM Leaderboard. Across both suites, mean win rate is hardest to manipulate: this gap is clear on BBH (24 tasks, 4507 models), where its median robustness is 22 tasks (92%), compared with 13 (54%) under arithmetic mean and 12 (50%) under median and pairwise majority.

[LG-17] How Many Training Samples Are Needed for the Inverse Kinematics Solutions by Artificial Neural Networks

链接: https://arxiv.org/abs/2605.23583
作者: Dong-Won Lim
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Inverse Kinematics (IK) plays a critical role in robotic motion planning and control. The IK solutions of a robot manipulator could be done by conventional ways such as geometric, algebraic, or Jacobian methods, which have drawbacks. The Artificial Neural Networks (ANNs) have become a promising alternative for approximating IK solutions due to their generalization ability and computational efficiency. This approach basically trains only a few samples of the end effector that are recorded for the solution of the IK problem. However, a fundamental question remains: how many training samples are sufficient to achieve reliable and accurate IK predictions? This study investigates the mathematical framework of relating the size of training datasets and the accuracy of ANN-based IK solvers. Using an articulated robotic manipulator, we generate varying amounts of joint-position pairs to train feedforward neural networks and assess their accuracy, convergence, and generalization capability. The results reveal more training samples than 125 did not contribute to the improvement of the model efficiency that the comparable measure dealing with the approximation accuracy over the sampling size, offering valuable insight into data efficiency. This work provides practical guidance for optimizing the data sizing of ANN solutions, balancing computational cost and model accuracy for real-world robotic applications.

[LG-18] Push Your Agent : Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

链接: https://arxiv.org/abs/2605.23574
作者: Yuandao Cai,Yuzhang Zhu,Liyou Gao,Wensheng Tang,Shengchao Qin
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Long-horizon language agents can make many plausible local tool calls yet fail to persist until a requested count is actually complete. We study this gap as Quantitative Goal Persistence (QGP): whether an agent keeps working until an external verifier confirms enough distinct valid items. PushBench turns this into a benchmark for repository-artifact collection and verifier-backed work units, so repeated work, duplicate submissions, false completion, and progress drift are measured directly rather than hidden behind a final success flag. In matched controller comparisons, a state-tracking retrieval controller reaches 69-78% success while eliminating duplicate submissions, and a backlog-tracking work-unit controller reaches 25-50% success in settings where standard and completion-gated controllers complete no task instances. Black-box frontier-agent evaluations with Claude Code (Sonnet 4.6) and Codex CLI (gpt-5.4) solve many 50-artifact tasks but drop to 3 out of 9 successes per condition at 100 artifacts. The results show that quantitative goals stress a different reliability requirement from local task competence: agents must maintain verified progress and stop only when the requested work is complete.

[LG-19] MARS: Magnitude-Aware Rank Statistics

链接: https://arxiv.org/abs/2605.23563
作者: Muhammad Rajabinasab,Afsaneh M. Nejad,Arthur Zimek
类目: Machine Learning (cs.LG)
*备注: Preprint submitted to Elsevier Pattern Recognition Letters

点击查看摘要

Abstract:Comprehensive evaluation of machine learning models is the key to make sure that they perform as robustly and consistently as desired. In order to summarize the experimental results and pick a winner, Critical Difference (CD) diagrams are used. Standard CD diagrams rely on discrete ranks, discarding the magnitude of performance gaps between models, raising an issue which we call magnitude-blindness. In order to address this issue, we propose Magnitude-Aware Rank Statistics (MARS) that incorporates a relative margin coefficient as a weight for the discrete ranks. This coefficient scales ranks based on the distance between the best and worst performers, with a dynamic projection to handle boundary cases. Followed by the calculation of a CD value, MARS results in a more realistic statistical representation of differences of model performances and more insights on how methods actually perform in vast and extensive experimental settings.

[LG-20] When One Point Is Not Enough: Addressing Ambiguous Instances in Dimensionality Reduction by Splitting

链接: https://arxiv.org/abs/2605.23540
作者: Diede P.M. van der Hoorn,Alessio Arleo,Fernando V. Paulovich
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dimensionality Reduction (DR) methods are widely used to visualize high-dimensional data. One key task in DR-based analysis is discovering neighborhoods, which relies on analyzing the fine-grained local structure of a projection. However, DR is an inherently lossy process; no technique can perfectly preserve the high-dimensional relationships, and projections therefore contain visual artifacts. In this paper, we highlight a typically overlooked source of visual artifacts: ambiguous instances. These are instances that are highly similar to multiple mutually dissimilar neighborhoods in the high-dimensional space. Standard DR methods cannot faithfully project such instances, since each data instance is mapped to a single point in the visual space. As a result, such an instance is placed in only one of its neighborhoods (or in none at all), so only part of its neighborhood structure is represented. We call this distortion partial neighborhood embedding. In this paper, we introduce a graph-based approach that identifies ambiguous instances and replicates them as multiple points in the projection, placing each copy within its respective neighborhood. We use UMAP for our results, but our approach also generalizes to other local graph-based DR techniques, and we show that our approach reveals previously hidden neighborhood memberships in projections and reduces partial neighborhood embedding across multiple examples, and is further supported by quantitative analyses.

[LG-21] Learning partially observed systems with neural Hamiltonian ordinary differential equations

链接: https://arxiv.org/abs/2605.23510
作者: Sunniva Meltzer,Sølve Eidnes,Alexander Johannes Stasik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When learning dynamical systems from data, embedding physical structure can constrain the solution space and improve generalization, but many physics-informed models assume access to the full system state. This limits their use in partially observed settings, where some state variables are completely unobserved and must be inferred without direct supervision. Here, we present neural Hamiltonian ordinary differential equations (NHODE), a framework that combines Hamiltonian neural networks (HNNs) with neural ordinary differential equations (neural ODEs) to learn partially observed dynamical systems from data. The Hamiltonian structure enforces energy conservation by construction, while the neural ODE framework enables a flexible training procedure that allows the loss to be defined only on observed variables. We also incorporate additional physical constraints through symmetry-aware coordinate transformations and separable energy formulations. The framework is evaluated on systems of increasing complexity, from linear and nonlinear mass-spring systems to the chaotic three-body problem. Across all examples, increasing the amount of embedded physical structure improves the accuracy and long-horizon stability of the predictions. Even in the most challenging regimes, the NHODE framework captures both observed and latent dynamics, whereas purely data-driven baselines become unstable.

[LG-22] Non-normal spectral signatures of instability in neural network training dynamics

链接: https://arxiv.org/abs/2605.23476
作者: Souvik Ghosh
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Optimization and Control (math.OC)
*备注: 9 pages, 3 figurea

点击查看摘要

Abstract:Training instabilities in deep networks - loss spikes, oscillatory convergence, and gradient pathologies - are empirically prevalent but lack a rigorous operator-theoretic explanation. We show that the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner, while for SGD with momentum it arises from the augmented state-space structure of the update map. Applying non-normal stability theory to these operators, we derive a conservative pseudospectral precursor bound in which \kappa(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one, and we establish that exceptional points of the update operator appear as the \kappa(V) - \infty limiting case of this framework. Numerical experiments on two-layer networks confirm that the spectral radius \rho(J) provides no separation between stable and unstable training phases while \kappa(V) separates them by approximately one order of magnitude, complementing the classical sharpness criterion with a continuous severity measure of non-normal amplification. These results establish non-Hermitian operator theory as a useful and underexplored framework for neural network optimization stability, offering a diagnostic language and proof-of-concept benchmark for understanding adaptive optimization stability.

[LG-23] S3GNN: Efficient Global Mixing and Local Message Passing for Long-Range Graph Learning

链接: https://arxiv.org/abs/2605.23467
作者: Dai Shi,Luke Thompson,Linhan Luo,Lequan Lin,Andi Han,Junbin Gao,José Miguel Hernández Lobato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Message-passing neural networks (MPNNs) often suffer from an information bottleneck when capturing long-range dependencies, leading to the oversquashing (OSQ) phenomenon. Alongside spatial connectivity enrichment (e.g., rewiring), recent studies have shown that spectral filtering can yield strong long-range learning outcomes, as spectral operators enable global information mixing that alleviates OSQ. These approaches achieve this either by stabilizing the Jacobian energies in deep propagation or by guaranteeing OSQ mitigation under strong theoretical assumptions. We revisit these conclusions and show that the associated Jacobian sensitivity lower bound is generally difficult to achieve in practice. We then propose S ^3 GNN, which mitigates OSQ without such restrictive assumptions by lightweightly reintroducing omitted components with substantially lower computational complexity, while standard stability constraints on feature transformations remain effective under our new dynamics. Extensive experiments across diverse domains (e.g., long-range benchmarks, KGQA, and mesh-based fluid dynamics) demonstrate that S ^3 GNN achieves up to an order-of-magnitude error reduction with up to 50% fewer parameters. Our code can be found in this https URL.

[LG-24] Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization NEURIPS2025

链接: https://arxiv.org/abs/2605.23464
作者: Alexander Long,Chamin Hewa Koneputugodage,Thalaiyasingam Ajanthan,Yan Zuo,Gil Avraham,Violetta Shevchenko,Hadi Mohaghegh Dolatabadi,Sameera Ramasinghe
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025. 34 pages, 6 figures (5 in main body, 1 in appendix). Alexander Long and Chamin Hewa Koneputugodage contributed equally

点击查看摘要

Abstract:We consider a decentralized setup in which the participants collaboratively train and serve a large neural network, and where each participant only processes a subset of the model. In this setup, we explore the possibility of unmaterializable weights, where a full weight set is never available to any one participant. We introduce Unextractable Protocol Models (UPMs): a training and inference framework that leverages the sharded model setup to ensure model shards (i.e., subsets) held by participants are incompatible at different time steps. UPMs periodically inject time-varying, random, invertible transforms at participant boundaries; preserving the overall network function yet rendering cross-time assemblies incoherent. On Qwen-2.5-0.5B and Llama-3.2-1B, 10,000 transforms leave FP32 perplexity unchanged ( \Delta PPL 0.01 ; Jensen-Shannon drift 4 \times 10^-5 ), and we show how to control growth for lower precision datatypes. Applying a transform every 30s adds 3% latency, 0.1% bandwidth, and 10% GPU-memory overhead at inference, while training overhead falls to 1.6% time and 1 % memory. We consider several attacks, showing that the requirements of direct attacks are impractical and easy to defend against, and that gradient-based fine-tuning of stitched partitions consumes \geq 60 % of the tokens required to train from scratch. By enabling models to be collaboratively trained yet not extracted, UPMs make it practical to embed programmatic incentive mechanisms in community-driven decentralized training.

[LG-25] Class-Dependent Hybrid Data Augmentation for Multiclass Migraine Classification under Severe Class Imbalance

链接: https://arxiv.org/abs/2605.23453
作者: Elvin Somón,Miguel A. Gutiérrez-Naranjo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We conducted a reproducibility-oriented re-evaluation of prior migraine classification studies, correcting for data leakage and metric bias. We then introduced (i) a clinically motivated aggregation of two hemiplegic subtypes following ICHD-3 §1.2.3, (ii) a class-dependent hybrid augmentation strategy that assigns generation methods based on per-class sample size, and (iii) the concept of fidelity asymmetry, motivating proportionally constrained growth as an alternative to full class balance. Experiments were performed on a dataset of 400 patients across seven migraine subtypes under a two-stage protocol, including the six-class configuration described above. Models were evaluated using stratified 5-fold cross-validation with macro-averaged F1 as the primary metric. Correcting methodological flaws reduces previously inflated performance estimates, with the corrected macro-F1 baseline standing at 0.71. The proposed framework consistently outperformed individual augmenters in macro-F1 averaged across the eight evaluated classifiers (0.862 vs. 0.836 for Gaussian Copula, 0.815 for CTGAN, and 0.801 for the no-augmentation baseline), and achieved its peak result of 0.914 with FT-Transformer under proportional augmentation. The no-augmentation FT-Transformer baseline (0.896) shows that, at the per-classifier ceiling, clinically motivated class aggregation accounts for most of the absolute improvement; the framework’s principal measurable contribution is the gain in average robustness across classifiers, highlighting the dominant role of problem formulation. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.23453 [cs.LG] (or arXiv:2605.23453v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.23453 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Weisfeiler-Leman Is Incomplete on Simple Spectrum Graphs so Canonicalize Them

链接: https://arxiv.org/abs/2605.23446
作者: Snir Hordan,Nadav Dym,Tim Seppelt
类目: Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:Graphs with a simple spectrum admit cubic-time isomorphism testing, yet we prove that for every natural number k , the k -Weisfeiler-Leman ( k -WL) test cannot distinguish all non-isomorphic graphs with a simple spectrum. As the WL hierarchy upper-bounds the distinguishing power of widely-used Graph Neural Networks (GNNs), this incompleteness applies to all such GNNs, ruling out completeness for every k -WL-aligned GNN family. To close this gap, we introduce PRiSM (Partition, Refine, Solve, Match), the first provably complete canonicalization of simple-spectrum eigendecompositions. PRiSM obtains the completeness guarantee that prior canonicalizations provably lack, and resolves the open problem of achieving complete expressivity on simple-spectrum graphs. When composed with DeepSets or a Transformer, PRiSM achieves universal approximation on simple-spectrum graphs, justifying the use of canonicalized Laplacian positional encodings. Empirically, PRiSM performs comparably to or outperforms existing spectral canonicalizations on graph regression, classification, and expressivity

[LG-27] Onsager-Machlup Posterior Transport for Deep Gaussian Processes

链接: https://arxiv.org/abs/2605.23434
作者: Jian Xu,Delu Zeng,John Paisley,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Approximate inference over inducing variables is the central computational bottleneck of Deep Gaussian Processes (DGPs). Existing methods either fit an explicit density q_\phi(\bU) by an ELBO (DSVI, IPVI, DDVI, DBVI) or sample by MCMC (SGHMC). We instead frame DGP inference as \emphposterior transport: learn a deterministic sampler that maps a tractable reference measure to posterior-relevant inducing variables, regularised by a path prior derived from the Doob-bridged reference diffusion. Our realisation, \textbfOM-Path (formally FBVI-bridge-Path), uses Song’s probability-flow ODE applied to DBVI’s Doob-bridged forward SDE; the reference drift is closed-form from the bridge marginal coefficients (no score matching) and the path regulariser is the \textbfOnsager–Machlup action. At the finite- \epsilon value used at training, the objective is the negative log unnormalised density of a tempered Doob-bridge path posterior, and Theorem 1 identifies it with the same posterior’s small-noise MAP path via the Freidlin–Wentzell LDP. Two strict path-space ELBO variants on the same bridge backbone (FFJORD log-det; OM-regularised CNF) are derived as ablations. Under a matched-seed paired Wilcoxon test against DBVI on seven UCI regression benchmarks, OM-Path delivers statistically significant wins on the two largest datasets (\textitpower: p!=!0.014 , NLL \mathbf0.012 matching the DSVI baseline of 0.017 ; \textitprotein: p!=!0.002 , RMSE \mathbf0.716 vs.\ 0.764 , NLL \mathbf1.086 vs.\ 1.149 ), statistical ties on \textityacht / \textitqsar, and concedes \textitboston / \textitenergy / \textitconcrete to DBVI on small- N noisy data. The strict-ELBO variants do not clear DBVI on any UCI metric: in this regime, reducing the variance of the path objective dominates exact-density tracking.

[LG-28] Sparse In-Network Learning via Shortest-Path Backpropagation and Finite-Rate Gating

链接: https://arxiv.org/abs/2605.23424
作者: Mohammad Reza Deylam Salehi
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-network learning (INL) trains distributed neural modules by exchanging latent activations and backpropagated errors over a communication graph. This letter proposes Dijkstra-pruned INL (D-INL), which removes non-tree links by retaining a capacity-aware shortest-path tree rooted at the fusion node. To balance sparsity and predictive information, local routing (or aggregation) is modeled as a finite-rate stochastic gate with rate R_g=I(Z; T) . We derive a rate-distortion-generalization bound and validate the method on a reproducible distributed-classification experiment, where D-INL reduces training exchange by 70.4% while preserving accuracy within the standard deviation of dense INL. Adding finite-rate regularization further reduces the estimated latent rate by 45.7% relative to unregularized Dijkstra INL.

[LG-29] Hinge Regression Trees and HRT-Boost: Newton-Optimized Oblique Learning for Compact Tabular Models

链接: https://arxiv.org/abs/2605.23422
作者: Hongyi Li,Jun Xu,Hong Yan
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2602.05371

点击查看摘要

Abstract:Learning high-quality oblique decision trees remains a significant challenge due to the discrete and non-convex nature of split optimization. We present the Hinge Regression Tree (HRT) framework, which reframes each oblique split as a nonlinear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like representation capacity. We show that the resulting node-level optimization can be interpreted as a damped Newton method, and we establish the monotonic decrease of the node objective for its backtracking line-search variant. We establish, theoretically, that HRT is a universal approximator with an explicit O(\delta^2) approximation rate. Building upon this base learner, we propose HRT-Boost, a mathematically synergistic ensemble extension that couples node-level Newton updates with stage-wise functional gradient descent. We show that this ensemble construction admits a stage-wise empirical risk reduction guarantee under the squared loss. Empirical evaluations on synthetic and real-world benchmarks show that HRT is highly competitive with established single-tree baselines, and HRT-Boost compares favorably with strong ensemble baselines and often yields substantially more compact models. The code is publicly available at this https URL.

[LG-30] An Open-Source Training Dataset for Foundation Models for Black-box Optimization

链接: https://arxiv.org/abs/2605.23417
作者: Aaron Klein,Herilalaina Rakotoarison,Luca Thale-Bombien,David Salinas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most black-box optimization methods require extensive hyperparameter tuning, often limiting their ability to generalize across different optimization domains. Foundation models for black-box optimization that learn optimization principles from a large collection of optimization trajectories offer a promising alternative, with the potential to outperform manually designed methods across diverse problem classes. However, prior work has either relied on non-public datasets or on purely synthetic data, limiting reproducibility and generalization to real-world problems. As a result, progress in this area has been constrained by the lack of large-scale, real-world, publicly available pre-training data. We introduce BBO-Pile, the first open-source dataset comprising over 500K optimization trajectories evaluated across 3095 different black-boxes for different optimizers, which represents by far the largest public dataset for this task. Using this dataset, we train a family of foundation models at multiple scales, ranging from 2M to 80M parameters and from 200M to 2B training tokens, and study their scaling behavior with respect to compute. Our results demonstrate that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods, paving the way for future research in this direction.

[LG-31] Hybrid Quantum-Classical Corrective Diffusion Modeling for Meteorological Downscaling

链接: https://arxiv.org/abs/2605.23403
作者: Rui Wang,Edoardo Pasetto,Amer Delilbasic,Morris Riedel,Kristel Michielsen,Gabriele Cavallaro
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Quantum Physics (quant-ph)
*备注: 11 pages, 9 figures. Submitted to IEEE QCE 2026

点击查看摘要

Abstract:Statistical downscaling is a crucial component of the weather modeling field, where high-resolution outputs must be reconstructed from coarse-resolution inputs with the full cost of dynamical refinement. In this work, we investigate a hybrid quantum-classical corrective diffusion model for probabilistic statistical downscaling of weather fields. The proposed model inserts variational quantum circuit layers into the most compressed bottleneck of the diffusion UNet while leaving the regression branch fully classical. This placement tests whether quantum circuits can act as compact nonlinear feature maps for latent-channel mixing. We evaluate intra-channel and cross-channel ansätze on 10m wind components. On the 2020 validation set, the hybrid models remain stable, preserve the large-scale spatial organization of the generated wind fields, and improve both MAE and CRPS relative to a classical corrective diffusion model in several configurations. Structural diagnostics further show that the hybrid variants preserve kinetic-energy spectra and windspeed distributions similar to its classical counterpart while producing controlled changes in tail behavior, extreme-windspeed localization, and joint wind field components structure. Backend studies on the 2020 validation set show negligible impact from simulated device noise at the tested circuit scale, whereas real-hardware deployment remains limited by qubit availability and execution fidelity. The 2021 out-of-distribution test shows that these in-distribution gains do not transfer uniformly under temporal shift, revealing a generalization gap that motivates future mitigation through stabilization and regularization. These results show that bottleneck-level quantum hybridization can make a nontrivial contribution to weather statistical downscaling, while also highlighting that circuit scale and hardware deployment remain key limiting factors.

[LG-32] Convex Compositional Reasoning Models

链接: https://arxiv.org/abs/2605.23395
作者: Meir Roketlishvili,Semyon Semenov,Maksim Bobrin,Viktor Kovalchuk,Albert Baichorov,Abduragim Shtanchaev,Fakhri Karray,Dmitry V. Dylov,Martin Takáč,Arip Asadulaev
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is not composition itself, but the non-convex geometry of the learned energy landscape. To solve this problem, we introduce Convex Compositional Energy Minimization (CCEM), a framework that parameterizes each factor with an input-convex neural network and optimizes the composed energy over a tight convex relaxation of the feasible set. Because convexity is preserved under summation, the global relaxed objective remains convex, enabling deterministic projected first-order optimization. CCEM is trained in two stages: factor-level contrastive learning to shape local energy basins, followed by end-to-end refinement through an unrolled projected solver. Our experiments show that our models trained on small subproblems or a single problem size transfer to larger instances without retraining.

[LG-33] Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization ICML2026

链接: https://arxiv.org/abs/2605.23391
作者: Youngjae Park,Jaemin Kim,Junghwa Hong
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 20 pages, 10 figures. Extended version of AI4Physics Workshop submission (ICML 2026)

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) for coupled multiphysics systems suffer systematic accuracy degradation as inter-equation coupling strengthens. We provide a theoretical explanation for this phenomenon through neural tangent kernel (NTK) analysis: for linearly coupled systems, we prove that the standard NTK’s spectral radius grows as \Omega(\gamma^2) with coupling strength \gamma , shrinking the stable learning rate, while block-diagonal Gauss–Newton (GN) preconditioning yields a preconditioned NTK K_P = J H^+ J^\top (where H is the block-diagonal GN Hessian) whose spectral radius is bounded by S ( S = number of networks), independent of \gamma . We verify the \Omega(\gamma^2) growth numerically across symmetric, asymmetric, and nonlinear coupled PDE systems, and confirm \lambda_\max(K_P) = S with equality in all cases. Combining the Kronecker-preconditioned optimizer SOAP with inverse-gradient-norm loss balancing (SOAP+GN) yields coupling-robust accuracy: across 234 experiments spanning three 1D systems of increasing nonlinearity and a 2D electroosmotic flow benchmark, SOAP+GN maintains final-epoch L_2 degradation \leq 1.1\times (ratio of strong- to weak-coupling error) even as coupling parameters vary over one to two orders of magnitude, compared with 10^2\times for Adam+GN. SOAP+GN further scales to a 2D, 6-PDE electroosmotic flow system at EDL-resolved conditions – a regime that all prior PINN electrokinetics studies have avoided through simplified physics – where Adam+GN fails entirely ( L_2 0.9 ).

[LG-34] Instance-Optimal Estimation with Multiple LLM Judges on a Budget

链接: https://arxiv.org/abs/2605.23362
作者: Junghyun Lee,Sanghwa Kim,Yassir Jedra,Alexandre Proutière,Se-Young Yun
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 53 pages, 4 figures; the first two authors contributed equally

点击查看摘要

Abstract:Evaluating large language models increasingly relies on LLM-as-a-judge protocols, but such evaluations remain costly: different judges have different prices and reliabilities, and the difficulty of each prompt-response pair can vary substantially. This raises a basic allocation question: under a fixed budget, how should one distribute evaluation queries across heterogeneous judges and instances to obtain the most accurate score estimates? We formalize this question as budgeted heteroskedastic multi-judge estimation. Given K prompt-response pairs, J judges with known costs, and unknown query-judge variances, the goal is to estimate a bounded score vector while minimizing an \ell_p -error. Our first contribution is to analyze the inverse-variance weighted estimator (IVWE) and to derive the oracle allocation that minimizes its error rate. Since this allocation depends on the unknown variances, we then address the practical unknown-variance setting by proposing EST-IVWE, an adaptive algorithm that constructs and leverages optimistically biased variance estimates to stabilize the empirical allocation. We prove that EST-IVWE matches the oracle IVWE rate up to lower-order terms in the budget. Our second and central theoretical contribution is a matching local minimax lower bound, which establishes the instance-optimality of the proposed algorithms. A key technical insight is that Fano-type high-probability arguments are too coarse for this problem: their packing construction loses the local variance structure that governs the optimal allocation. We instead use an Assouad-type in-expectation argument, based on local perturbations, which preserves this structure and yields the sharp allocation-dependent lower bound. Finally, we numerically validate the superiority of our approach over naïve uniform allocation on synthetic and HelpSteer2 datasets.

[LG-35] Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

链接: https://arxiv.org/abs/2605.23351
作者: Ting Hu,Luanda Cai,Emmanouil-Vasileios Vlatakis-Gkaragkounis
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated “safe” baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee. To bridge this gap, we propose Prudent-Banker, a novel algorithm that combines a delay-adapted variant of Online Mirror Descent with a modified phased-aggression mechanism. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. We also establish new lower bounds for safety-constrained adversarial delayed bandits, showing that the regret guarantees of Prudent-Banker are unimprovable, up to logarithmic factors, under the baseline-safety requirement. To the best of our knowledge, Prudent-Banker is the first algorithm to achieve the optimal safety–robustness trade-off: pseudo-regret \widetildeO(\sqrtT+\sqrtD) together with \widetildeO(1) regret against the safe comparator, both with and without delays. Experiments across diverse delay distributions show that, unlike standard delay-robust baselines, Prudent-Banker effectively balances safety and learning.

[LG-36] Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

链接: https://arxiv.org/abs/2605.23346
作者: Jaihoon Kim,Taehoon Yoon,Prin Phunyaphibarn,Seungjun Kim,Morteza Mardani,Minhyuk Sung
类目: Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Discrete diffusion models have emerged as powerful frameworks for generating structured categorical data. However, efficiently sampling from reward-tilted distributions remains a fundamental challenge. While Twisted Sequential Monte Carlo (SMC) offers asymptotic exactness for this task, estimating the optimal twist function in discrete state spaces necessitates costly Monte Carlo approximations, resulting a severe computational bottleneck at inference. To overcome this limitation, we introduce Contrastive Distribution Matching (CDM), a novel framework that amortizes the cost of SMC inference by learning a parameterized twist function via positive and negative samples. For efficient training, we reformulate the gradient estimator to leverage the closed-form forward kernels of discrete diffusion models. In practice, evaluating our learned twist function incurs less than 5% additional computational overhead compared to a single forward pass of the base model. Through extensive empirical evaluations, we demonstrate that CDM consistently outperforms existing baselines under matched wall-clock time. We validate the effectiveness and versatility of our approach across a diverse range of applications, including toxic text generation, regulatory DNA sequence design, protein designability, and diffusion large language model alignment.

[LG-37] Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models ICML2024

链接: https://arxiv.org/abs/2605.23275
作者: Egor Lifar,Semyon Savkin,Timur Garipov,Shangyuan Tong,Tommi Jaakkola
类目: Machine Learning (cs.LG)
*备注: Accepted as poster at ICML 2024 Workshop on Structured Probabilistic Inference and Generative Modeling (SPIGM)

点击查看摘要

Abstract:In this paper, we propose Diffusion Domain Expansion (DDE), a method that efficiently extends pre-trained diffusion models to generate larger objects and handle more complex conditioning beyond their original capabilities. Our method employs a compact trainable network designed to coordinate the denoised outputs of pre-trained diffusion models. We demonstrate that the coordinator can be universally simple while being capable of generalizing to domains larger than those observed during its training time. We evaluate DDE on long audio track generation and conditional image generation, demonstrating its applicability across domains. DDE outperforms other approaches to coordinated generation with diffusion models in qualitative and quantitative evaluations.

[LG-38] A Simple Plug-in for Improving Eviction-Based KV Cache Compression

链接: https://arxiv.org/abs/2605.23258
作者: Yuping Lin,Jiayuan Ding,Yue Xing,Pengfei He,Jiliang Tang,Subhabrata Mukherjee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical for exact retention but are still reconstructable. We present VECTOR, a plug-and-play augmentation for eviction-based pipelines that introduces three-way token routing: retention, approximation, and eviction. VECTOR combines an importance signal from the base scorer with a reconstructability signal from an offline-calibrated regression-based value estimation. By leveraging reconstructability, VECTOR recovers useful value information that would otherwise be irreversibly lost under binary eviction, while preserving key vectors for attention routing stability. Experimental results show that VECTOR improves quality-memory trade-offs under medium-to-high compression, with especially clear gains in stricter budget regimes.

[LG-39] Learning-Augmented Online Scheduling with Parsimonious Preemption

链接: https://arxiv.org/abs/2605.23255
作者: Mugen Blue,Sungjin Im,Alexander Lindermayr
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Learning-augmented algorithms have emerged as a powerful paradigm to surpass traditional worst-case lower bounds by integrating potentially noisy predictions. While this framework has seen success in online scheduling, existing work primarily optimizes job latency while relying on frequent, ``blind’’ preemptions. This ignores the fundamental trade-off between algorithmic performance and preemption complexity. We provide the first systematic study of learning-augmented scheduling that curbs preemption while optimizing latency. We establish that the gap between theoretical latency bounds and preemption overhead can be bridged with solid analytical foundations. Our results include O(1) -competitive algorithms for single and unrelated parallel machines with only O(1) preemptions per job under accurate predictions, with overhead scaling logarithmically with the prediction error. By providing the first bounded-preemption guarantees for unrelated and malleable machines, we extend the theoretical reach of the learning-augmented framework to more constrained and realistic settings. Finally, our algorithms are validated through experiments.

[LG-40] Accelerating Divisible Load Processing Through Machine Learning: A Practical Framework for Large-Scale Workloads

链接: https://arxiv.org/abs/2605.23247
作者: Bharadwaj Veeravalli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the first machine learning framework for predicting optimal processing times in Single-Level Tree Network (SLTN) architectures for the Divisible Load Theory (DLT) paradigm. Using a feedforward neural network(FNN) with 16 engineered features, we train a model on 100,000 synthetically generated configurations to predict optimal processing times without explicit formulation of DLT equations. The model achieves 97-99% accuracy (R-square factor) with mean absolute percentage error of 1-5%, demonstrating that neural networks can effectively learn complex load distribution relationships. Feature importance analysis reveals that the model implicitly captures DLT mathematical structure, including load conservation and simultaneous finishing constraints. With inference times under 1 millisecond, the approach provides 10-100x speedup over traditional DLT computation, enabling applications in real-time scheduling, design space exploration, and cloud resource allocation. The method generalizes well across diverse system configurations (n=3 to 20, load size =1 to 100 GB) with consistent accuracy, though performance degrades slightly for very large or highly heterogeneous systems. This work demonstrates the feasibility of using machine learning to accelerate distributed computing optimization while maintaining near-optimal accuracy.

[LG-41] Convex Optimization for Alignment and Preference Learning on a Single GPU

链接: https://arxiv.org/abs/2605.23244
作者: Miria Feng,Mert Pilanci
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, high dependence on GPU resources, and expensive hyperparameter tuning. We propose the Convex Optimization for Alignment and Preference Learning Algorithm (COALA): a novel lightweight strategy with strong theoretical guarantees. By leveraging the convex optimization reformulation of neural networks, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across four datasets–including a 26621-sample synthetic Educational Feedback dataset–and six models (including Llama-3.1-8B) demonstrate COALA’s competitive performance and efficiency while utilizing as little as ~17.6% of DPO’s total TFLOPs. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly shorter time in comparison to traditional methods such as DPO and ORPO. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.

[LG-42] RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases

链接: https://arxiv.org/abs/2605.23241
作者: Jinyu Yang,Cheng Yang,Junze Chen,Zedi Liu,Muhan Zhang,Hanyang Peng,Chuan Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Relational databases (RDBs) remain the cornerstone of modern data systems and support diverse predictive tasks. Recent relational deep learning (RDL) methods enable end-to-end prediction by converting RDBs into graphs, where rows are represented as nodes and inter-table interactions are represented as edges, and then applying graph-based models for representation learning. Despite the strong capability of RDL, effective self-supervised pre-training for RDBs remains non-trivial. RDB tasks often require multi-faceted information across different perspectives and granularities. For example, user churn classification may rely more on interaction patterns, whereas consumption value prediction requires both user-item behaviors and intrinsic user attributes for fine-grained regression. Such heterogeneous needs challenge RDB representation learning, as pre-training objectives should cover comprehensive information for downstream adaptation. However, existing SSL methods typically derive supervision from a single facet, such as node-level intrinsic attributes or subgraph-level relational structures, providing limited adaptability. To this end, we propose RelPrism, a multi-faceted self-supervised learning framework for RDBs. RelPrism constructs intrinsic, relational, and hybrid attributes from distinct perspectives, and applies multi-granularity clustering to each perspective to form corresponding pseudo-task pools. Pre-training over these pools exposes representations to broader perspectives and granularity levels, yielding a stronger basis for downstream adaptation. Experiments on 14 tasks across 5 real-world datasets show that RelPrism improves ROC-AUC by 4.15% for classification and reduces MAE by 10.75% for regression over state-of-the-art baselines. Our code is available at this https URL.

[LG-43] Self-supervised Adversarial Purification for Graph Neural Networks

链接: https://arxiv.org/abs/2605.23239
作者: Woohyun Lee,Hogun Park
类目: Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:Defending Graph Neural Networks (GNNs) against adversarial attacks requires balancing accuracy and robustness, a trade-off often mishandled by traditional methods like adversarial training that intertwine these conflicting objectives within a single classifier. To overcome this limitation, we propose a self-supervised adversarial purification framework. We separate robustness from the classifier by introducing a dedicated purifier, which cleanses the input data before classification. In contrast to prior adversarial purification methods, we propose GPR-GAE, a novel graph auto-encoder (GAE), as a specialized purifier trained with a self-supervised strategy, adapting to diverse graph structures in a data-driven manner. Utilizing multiple Generalized PageRank (GPR) filters, GPR-GAE captures diverse structural representations for robust and effective purification. Our multi-step purification process further facilitates GPR-GAE to achieve precise graph recovery and robust defense against structural perturbations. Experiments across diverse datasets and attack scenarios demonstrate the state-of-the-art robustness of GPR-GAE, showcasing it as an independent plug-and-play purifier for GNN classifiers. Our code can be found at this https URL.

[LG-44] Convex Low-resource Accent-Robust Language Detection in Speech Recognition

链接: https://arxiv.org/abs/2605.23235
作者: Miria Feng,William Tan,Mert Pilanci
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on under-represented dialects and accents, often misidentifying the input language and causing cascading failures in downstream dialogue tasks. Addressing this dialectal variance under low-resource constraints remains an open challenge, as standard fine-tuning is computationally expensive and prone to overfitting on high-dimensional speech data. We propose Convex Language Detection (CLD), a novel framework that integrates theoretically grounded convex optimization techniques into the spoken dialogue systems pipeline. Our method is efficiently implemented via multi-GPU Alternating Direction Method of Multipliers (ADMM) in JAX, thus providing global optimality guarantees and fast training in polynomial time. Theoretically, we prove that our convex objective induces certified margin stability and provide guarantees against feature perturbations. Empirically, we demonstrate sample efficiency and robustness to input dialectical variation, achieving 97-98% accuracy in challenging low-resource regimes. Our open-source package is available at this https URL

[LG-45] Assessing Predictive Models for Fairness Based on Movement Patterns

链接: https://arxiv.org/abs/2605.23234
作者: Francesco Lettich,Mario A. Nascimento,Chiara Pugliese,Chiara Renso
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 33 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Assessing the spatial fairness of predictive models involves establishing whether they are statistically penalizing (favoring) individuals associated with certain geographical locations. Literature on this topic makes the fundamental assumption that each individual is assigned to a single geographical location (e.g., place of residence). However, fairness with respect to the set of locations where one has been, i.e., their movement patterns over different regions, also matters when fairness is considered. Consequently, we argue that it is necessary to generalize the notion of spatial fairness to also include movement patterns, leading to the novel problem of assessing predictive models for fairness relative to the movements of individuals. To deal with this problem, we propose an approach that first associates the movements of individuals to certain geographic regions, considering multiple spatial partitions with different resolutions and alignments, and then employs a suitable spatial scan statistic to assess whether a predictive model is fair based on movement patterns. In the experimental evaluation, we study the performance of our approach over thousands of synthetic unfair datasets, showing that it is effective at detecting this new type of unfairness and at retrieving the set of objects treated unfairly, while localization performance exhibits a consistent multi-resolution trade-off.

[LG-46] Entropy Equivalence Testing

链接: https://arxiv.org/abs/2605.23225
作者: Clément L. Canonne,Yash Pote,Jonathan Scarlett,Joy Qiping Yang
类目: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We introduce the problem of \emphentropy equivalence testing for probability distributions, a relaxation of the well-studied closeness testing problem, where the distribution testing algorithm is now only required to distinguish, given samples from two unknown distributions p,q and a parameter \varepsilon \in(0,1/2] , between p=q and |H§-H(q)| \geq \varepsilon (where H denotes the Shannon entropy). We provide a time- and sample-efficient algorithm for this task, showing that the optimal sample complexity for this task can be significantly lower than that of closeness testing. As an application, we leverage this result to provide the first non-trivial testing algorithm for (standard) closeness of low-degree \emphBayesian networks, which significantly improves on either the sample or time complexity of a baseline based on full learning.

[LG-47] WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

链接: https://arxiv.org/abs/2605.23220
作者: Zhixiang Guo,Siyuan Liang,Shi Fu,Cheng Guo,Andras Balogh,Mark Jelasity,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the growing use of world models as decision-making agents, their adversarial robustness remains underexplored due to the lack of dedicated automated evaluation methods. A key obstacle is that attack evaluation must be both accurate and efficient: weak manually tuned attacks can overestimate robustness, while exhaustive hyperparameter search is prohibitively expensive because each candidate requires closed-loop rollouts through learned latent dynamics. We introduce WMAttack, an automated attack-search framework for adversarial evaluation of world-model agents. WMAttack formulates robustness evaluation as a finite-budget search over attack configurations, including attack families, perturbation budgets, optimization steps, restarts, and allocation rules. To improve search accuracy, Self-Correcting Attack Search (SCAS) refines the attack proposal distribution using feedback from reward degradation, action instability, runtime cost, and rollout variability. To improve search efficiency, Representation-Guided Attack Retrieval (RGAR) retrieves effective historical configurations from representation-similar tasks, providing a warm start for unseen environments. We provide a theoretical explanation showing that proposal refinement improves finite-budget search when it shifts probability mass toward high-utility attacks. Across Atari and DeepMind Control tasks, WMAttack consistently discovers stronger attacks than the evaluated baselines, improving normalized reward drop from 0.497 to 1.034 on DreamerV3 Atari and from 0.319 to 0.682 on DMC. Ablations further show that RGAR improves initial candidate quality and SCAS improves final attack utility under fixed evaluation budgets.

[LG-48] Label-Efficient Dataset Pruning via Semi-Supervised Pseudo-Labeling

链接: https://arxiv.org/abs/2605.23198
作者: Yeseul Cho,Baekrok Shin,Changmin Kang,Chulhee Yun
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Dataset pruning reduces the storage and training costs of deep learning by selecting an informative subset from a large dataset. However, most existing pruning methods require fully labeled data, which limits their applicability in realistic settings where unlabeled data are abundant and annotation is costly. Recent label-free pruning methods address this issue, but they rely on features from pretrained models to estimate example difficulty. This dependence can be unreliable when the target dataset differs substantially from the pretraining distribution. We propose SemiPrune, a label-efficient dataset pruning framework, using only a small randomly labeled subset, that uses semi-supervised learning to generate pseudo-labels for unlabeled data, allowing existing supervised pruning methods that require label information to be seamlessly applied to the resulting pseudo-labeled training pool. We then estimate example difficulty from pseudo-label-induced training dynamics and select a coreset. By learning directly from the target dataset, our method better captures the target distribution and provides more reliable signals for difficulty estimation and coreset selection. We validate our approach on domain-specific, image-corrupted, and long-tailed datasets, where it achieves state-of-the-art performance among label-free and label-efficient baselines, while also demonstrating competitive performance on standard benchmarks.

[LG-49] Empirical Bayes Conformal Prediction for Vision and Language Models

链接: https://arxiv.org/abs/2605.23189
作者: Jiapeng Zeng,Yogesh Prabhu,Zhanpeng Zeng,Michael A. Newton,Vikas Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction (CP) gives distribution-free coverage for modern vision and language models, but it is often forced to make a ranking decision from a single unstable nonconformity score. Standard CP uses one realization, while average-then-calibrate variants smooth multiple realizations into a point estimate. Both options discard the inconsistency that can help identify whether a candidate is indeed stable. A weak answer can enter the conformal set even if the evidence is not strong, simply because one posterior sample or prompt phrasing made it look strong. But variability can help distinguish a stable signal from noise-driven fluctuations. We describe an empirical Bayes conformal prediction framework that uses r -values to convert score variability into an uncertainty informed nonconformity score. The resulting r -value estimates how likely a candidate’s latent score belongs to the top-ranked group after accounting for both its mean score and its uncertainty. It admits both a closed-form Normal-Normal empirical Bayes estimator and a nonparametric posterior-sampling estimator. Using the r -value as the nonconformity score preserves the target conformal coverage while provably reducing the inclusion of high variance false candidates under mild regularity conditions. Across image classification, CLIP-based VLM benchmarks, and LLMs, we show that r -value conformal prediction preserves target coverage while improving ranking stability and reducing set size when variability is informative, and reverting to CP-like behavior when variability vanishes.

[LG-50] Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

链接: https://arxiv.org/abs/2605.23182
作者: Zitian Li,Wang Chi Cheung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough’’ policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold \mu_0 , GPI only requires identifying a policy with expected reward in an episode at least \mu_0 if such a policy exists (positive instance), or declaring None if no such policy exists (negative instance). We formalize GPI under the fixed-confidence setting. We require the output to be correct with probability \geq 1-\delta , and seek to minimize the expected sample complexity, which is the expected number of episodes explored for the output. We propose a novel algorithm BEE-GPI, and derive theoretically-grounded upper bounds on its sample complexity for positive and negative instances. Notably, for positive instances, the coefficient of \log 1/\delta in our upper bound is O(H^2/(V^* - \mu_0)^2) , where H is the episode length and V^* is the optimal expected reward in an episode. The coefficient does not depend on the action and state space sizes otherwise, in sharp contrast to the sample complexity in BPI. We further establish lower bound results to show the near-optimality of BEE-GPI and the necessity of the 1/(V^* -\mu)^2 term. Numerical experiments further validate the efficiency of our approach.

[LG-51] Any-Dimensional Invariant Universality

链接: https://arxiv.org/abs/2605.23156
作者: Shengtai Yao,Eitan Levin,Mateo Díaz
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Representation Theory (math.RT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Several machine learning models are defined for inputs of any size, such as graphs with different numbers of nodes and point clouds containing varying numbers of points. The universality properties of such any-dimensional models remain poorly understood, as universality is traditionally studied for models accepting inputs of a fixed size, defined on a compact subset of their domain. In sharp contrast, any-dimensional models can be viewed as sequences of functions defined on growing-sized inputs, and it is not clear in which sense they can be universal. We develop a systematic approach to establish any-dimensional universality, by identifying any-dimensional functions with a unique function taking inputs in a suitable infinite-dimensional limit space containing inputs of all finite sizes as well as their limits. Using the symmetries of these inputs and relations between inputs of different sizes, we show that this limit space admits a natural topology with rich families of compact sets on which any-dimensional universality can be established. We illustrate our approach by showing that several existing architectures fail to be universal, and we propose simple modifications that restore universality.

[LG-52] Archimedean Copula Inference via Taylor-Mode AD

链接: https://arxiv.org/abs/2605.23134
作者: Cambridge Yang,Dongdong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:No existing nested Archimedean copula tool handles all three of (a) arbitrary per-variable (right-)censoring in survival analysis, (b) arbitrary nesting trees, and © exact parameter gradients. Existing implementations handle only bivariate problems, low dimensional (i.e., d \leq 10 ) cases, two layers of nesting, or only hand-derived copula nestings. We present \textscacopula, a JAX-native framework that, given any Archimedean generator – classical or neural – evaluates exact nested-copula likelihoods and parameter gradients under arbitrary censoring masks in polynomial time. The mechanism is polynomial powering of Taylor-mode automatic differentiation output, which replaces per-family hand-derived partial Bell polynomial tables with a single differentiable computation that any user-defined generator can drive. We conduct extensive simulations to verify the correctness of \textscacopula. We then demonstrate (a) per-variable censoring on 85,229 MIMIC-IV ICU admissions in high dimensions with d=53 , fit by both classical Archimedean families and nested neural Archimedean copulas; (b) an 11-sector hierarchical model on S\P~500 daily returns at d=98 ; © family-agnostic censored MLE across ten families, five of them with no prior implementation, on a retinopathy study; and (d) a \sim650\times per-density speedup over R’s \textttnacLL at d=35 , scaling quadratically to d=8,000 .

[LG-53] When Determinants Are Not Enough: Private Rare Switching

链接: https://arxiv.org/abs/2605.23131
作者: Xingyu Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this note, I would like to share a small research moment where Codex helped me find the right way to adapt rare switching to the private setting. The standard determinant-based update rule in linear bandits and RL works beautifully because the design matrix grows monotonically. But once Gaussian noise is added for privacy, this monotonicity can fail, and the usual analysis no longer goes through. The key reason is that determinant growth controls volume, while regret analysis needs control of the worst direction. To address this, Codex comes up with a different rare-switching rule based on the generalized Rayleigh quotient, which restores logarithmic policy updates and the desired confidence-width comparison up to a constant factor. I present my manually clean-up version of the proof here as well as some personal reflection on this example.

[LG-54] Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift

链接: https://arxiv.org/abs/2605.23115
作者: Yiming Ma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bike-sharing models trained on historical station-hour data may degrade when deployed in later years because travel patterns change over time. This paper studies March Citi Bike demand prediction from 2021 to 2026 as a temporal domain adaptation problem and proposes Gen-ROTDA, a robust optimal transport-guided residual domain adaptation framework. The method fits a target-domain station-time anchor with a small labeled target subset, transfers residual rather than raw demand, applies a deterministic label-preserving residual feature generator, and trims high-cost transport matches before training the final residual predictor. Experiments compare Gen-ROTDA with anchor-only, source-only, target-only, fine-tuning, MMD adaptation, Sinkhorn OTDA, ROTDA, and Gen-OTDA. Gen-ROTDA achieves the lowest MAE on the main 2025 to 2026 task and is the best OT-family method on average across multi-year tasks, although fine-tuning and MMD adaptation remain strong overall baselines. Under abnormal target-unlabeled records, Gen-ROTDA is much more stable than non-robust OT variants, suggesting that robust transport is useful for noisy temporal transfer in bike-sharing demand prediction.

[LG-55] Encrypted Neural Networks without Overflows

链接: https://arxiv.org/abs/2605.23096
作者: Philipp Kern,Lorenzo Rovida,Samuel Teuber,Edoardo Manino,Carsten Sinz,Alberto Leporati
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Fully homomorphic encryption (FHE) enables private inference by evaluating neural networks on encrypted data. In this way, we can delegate the computation to a third party server without ever revealing the user’s data. Currently, the CKKS scheme is the backbone of most efficient FHE implementations, but it only supports addition, multiplication, and array rotation operations, thus requiring all activation functions of the neural network to be approximated by polynomials within a certain interval, imposing strict design tolerances. In this paper, we demonstrate for the first time that this scheme is vulnerable to overflow attacks, i.e., seemingly benign inputs that can exceed such tolerances of the FHE circuit, thereby causing corrupt and unusable outputs. To avoid them, we propose a formal verification technique that computes certified bounds on the ranges of all neurons in the network. By construction, our method eliminates overflows and, in our experiments, removed observed overflows on all benchmarks, reducing failure rates from up to 47% to 0%. Moreover, our overflow-free solution is compatible with most CKKS-based frameworks, as it allows to simply substitute standard polynomials by polynomials with rigorously designed ranges.

[LG-56] he Implicit Bias of Depth: From Neural Collapse to Softmax Codes

链接: https://arxiv.org/abs/2605.23087
作者: Connall Garrod,Jonathan P. Keating,Christos Thrampoulidis
类目: Machine Learning (cs.LG)
*备注: 46 pages, 11 figures, accepted at the International Conference on Machine Learning 2026

点击查看摘要

Abstract:Neural collapse (NC) describes the structured geometry that emerges in the features and weights of trained classifiers. Recent theory suggests NC can be suboptimal in deep architectures, attributing this to an explicit low-rank bias from L2 regularization. We study the deep unconstrained feature model (UFM)-equivalent to a deep linear network with orthogonal inputs-trained without regularization, to isolate how gradient descent and depth alone shape NC. We show that depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to NC. These alternatives, we argue, correspond to softmax codes: max-margin solutions previously found in width-bottlenecked networks. Analyzing training dynamics under spectral initialization, we identify an early-time repulsion among singular values that drives low-rank emergence, and characterize how depth shrinks NC’s basin of attraction. Finally, we show that some effects act in the opposite direction: for randomly initialized networks, increasing width biases training toward higher-rank solutions. Our results provide the first asymptotic and dynamic characterization of implicit bias in deep UFMs trained with unregularized multiclass cross-entropy.

[LG-57] hriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

链接: https://arxiv.org/abs/2605.23081
作者: Joe Sharratt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention’s advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at this https URL.

[LG-58] he Attribution Contract: Feature Attribution for Generative Language Models

链接: https://arxiv.org/abs/2605.23080
作者: Giang Nguyen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature attribution methods promise to identify which input features matter for a model output. In generative language models, however, it is often unclear what should count as a feature in the first place. In autoregressive language models, earlier generated tokens are both outputs of the model and inputs to later predictions. In diffusion language models, generation proceeds through iterative denoising or unmasking rather than fixed left-to-right prediction, so local explanation may target a state of diffusion rather than a next token. We argue that this ambiguity is not merely an implementation detail, but a conceptual limitation of carrying classifier-era feature attribution directly into generative language modeling. We introduce the Attribution Contract, a specification for feature-attribution claims that names what output is being explained, which features are eligible to receive attribution, what generative process is assumed, what is held fixed, and what model score is being attributed. The contract clarifies why the same attribution method can answer different questions depending on how it is instantiated. We argue that many disagreements about feature attribution in generative language models are not disagreements about attribution algorithms, but about unstated explanatory contracts. Using autoregressive and diffusion language models as case studies, we show when attribution to earlier generated tokens, intermediate states, or denoising stages is informative, when it is misleading, and why feature-attribution methods in generative language models should be evaluated as method-contract pairs.

[LG-59] Orbax: Distributed Checkpointing with JAX

链接: https://arxiv.org/abs/2605.23066
作者: Colin Gaffney,Shutong Li,Daniel Ng,Anastasia Petrushkina,Niket Kumar,Adam Cogdell,Mridul Sahu,Yaning Liang,Nikhil Bansal,Justin Pan,Angel Mau,Abhishek Agrawal,Marco Berlot,Ruoxin Sang,Kiranbir Sodhia,Rakesh Iyer
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 18 pages, 5 tables, 6 figures

点击查看摘要

Abstract:In a landscape of high-performance distributed ML systems, JAX has emerged as a framework of choice. However, JAX’s modular design philosophy leaves it without a standardized checkpointing solution. In this paper, we introduce Orbax, a modular, JAX-native checkpointing library that abstracts the complexities of distributed accelerator systems while also providing flexibility for user-friendly checkpoint manipulations throughout the ML model lifecycle. We demonstrate performance exceeding comparable PyTorch competitors by up to 3.5 \times for saving and 2 \times for loading. The library is available at this https URL.

[LG-60] Steered Generation via Gradient-Based Optimization on Sparse Query Features

链接: https://arxiv.org/abs/2605.23040
作者: Sumanta Bhattacharyya,Pedram Rooshenas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Latent steering exploits internal representations of Large Language Models (LLMs) to guide generation, yet interventions on dense states can entangle distinct semantic features. In this paper, we investigate attention query activations as a high-fidelity site for precise control, hypothesizing that manipulating the attention mechanism itself offers sharper steerability than general state interventions. We introduce Prototype-Based Sparse Steering, a framework that applies Sparse Autoencoders (SAEs) specifically to query activations, to decompose them into interpretable features, then apply gradient-based optimization during inference to align the sparse representation with class prototypes of target behaviors. To validate this architectural insight, we first analyze the mechanism in Textualized Gridworld, a controlled environment for verifiable planning constraints. We demonstrate that optimizing sparse query features enables effective navigation of rigid planning requirements (i.e., safe vs. short paths), confirming the method’s ability to satisfy objective rules. We then demonstrate the framework’s versatility by training SAEs on a high-dimensional educational domain, where the framework steers the cognitive complexity of feedback (i.e., Bloom’s Taxonomy). Our experiments establish that sparse query representations provide the necessary disentanglement for unified, interpretable control over both logical planning and stylistic nuance.

[LG-61] Open Multimodal Datasets and Open-Source Software for Data-Driven Modeling of Multiphase Transport and Thermal Systems

链接: https://arxiv.org/abs/2605.23037
作者: Christy Dunlap,Hari Pandey,Stephen Pierson,Daniel Curl,Braden Stevens,Mohammad Ishraq Hossain,Annapurna Parjuli,Chinmaya Joshi,Han Hu
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 23 pages, 7 figures

点击查看摘要

Abstract:Data-driven modeling is becoming central to multiphase transport, electronics cooling, acoustic diagnostics, and thermal-fluid digital twins, but progress is limited by fragmented datasets and raw instrument files that are difficult to decode, reuse, or benchmark. This paper presents an open ecosystem of multimodal datasets and open-source software packages developed by the Nano Energy and Data-Driven Discovery (NED3) Laboratory for reproducible AI-enabled thermal-fluid research. We introduce a spatial-plus-temporal dimensionality framework, denoted S+TD, to classify datasets by the dimensionality of measured or simulated fields, including 0+0D point values, 0+1D time series, 1+0D profiles, 2+0D images, 2+1D videos, 3+0D volumetric fields, and multimodal combinations. We organize public NED3 datasets spanning boiling images, acoustic and thermal measurements, high-speed videos, infrared thermography, thermal-resistance measurements, CFD-generated fields, design files, and acoustic-emission data. We also describe complementary software packages, including BubbleID, SeqReg, CFDTwin, IRISApp, decode-wfs, AELab, and FlowLab, which support computer vision, sequence regression, surrogate modeling, infrared analysis, waveform decoding, acoustic-emission analysis, and multimodal diagnostics. Particular emphasis is placed on SeqReg, a general sequence-regression library for 0+1D, 1+1D, and 2+1D data, with applications such as nonintrusive heat-flux estimation. Finally, we discuss future community efforts to build interoperable thermal-fluid databanks and curated AI/ML tool libraries that connect datasets, metadata, decoders, baselines, benchmarks, and physically interpretable models.

[LG-62] World Machine: Towards Generative World Modeling for Time-Series

链接: https://arxiv.org/abs/2605.23025
作者: Elton Cardoso do Nascimento,Alexandre da Silva Simões,Esther Luna Colombini,Ricardo Ribeiro Gudwin,Paula Dornhofer Paro Costa
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:World models represent a paradigm shift in generative AI, pursuing predictive understanding and controllable simulation of environments in a structured and generalizable way. We present World Machine, a generative world-modeling architecture for time series. It is a transformer-based architecture with latent states that enables adaptation to different amounts of observed data and contexts. This shows an improvement over traditional transformers, which have a computational and memory cost that scales quadratically with the context. Experiments on a proposed synthetic dataset, Toy1D, validate the approach’s feasibility, demonstrate capabilities not found in conventional transformers, and highlight the contributions of each component of the training protocol.

[LG-63] PACE: Two-Timescale Self-Evolution for Small Language Model Agents

链接: https://arxiv.org/abs/2605.23019
作者: Chen Ling,Pei Chen,Albert Guan,Jiaming Qu,Shayan Ali Akbar,Madhu Gopinathan,Erwin Cornejo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying language-model agents in production often requires substantial compute and human effort to tune prompts, parsers, validators, and other components of the agent pipeline. Self-evolution offers a promising alternative, but most existing frameworks assume access to frontier models that can reliably diagnose failures, propose revisions, and judge their own updates. We study whether frozen small language models (SLMs) can serve as effective self-evolving agents under resource constraints. We propose PACE (Prompt And Control Logic Evolution), a two-timescale framework that coordinates low-risk prompt refinement with higher-risk control-logic updates. PACE evolves prompts under fixed control logic until prompt-level gains saturate, then considers constrained control-logic updates that are accepted through held-out validation. Across three frozen SLM backbones ranging from 4B to 14B parameters and four controlled benchmarks, PACE achieves the best performance on all 12 backbone–benchmark combinations, improving over vanilla SLM agents by up to +9.2% relative improvement and over the stronger single-mode evolution baseline by up to +5.4% relative improvement. A tau-bench case study further shows that PACE improves multi-turn tool-use success over vanilla and prompt-only evolution. These results suggest that reliable SLM agent self-evolution is possible without updating model weights or relying on frontier-model teachers, and that the key benefit is not any single final solver pattern but autonomous, validated discovery of task-appropriate inference strategies.

[LG-64] Smoothed Elicitation Complexity for Approximate Γ-calibration of Discrete Classification Tasks

链接: https://arxiv.org/abs/2605.23017
作者: Jessica Finocchiaro,Victor Ganson,Drona Khurana
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: Working paper

点击查看摘要

Abstract:One prominent method of evaluating machine learning model trustworthiness is the notion of calibration. In the binary outcome setting, a probabilistic predictor is calibrated if outcomes are realized according to a model’s distributional prediction, conditioned on this prediction. Straightforward extensions of binary calibration definitions to probabilistic multiclass classifiers suffer from an exponential complexity blowup as the space of predictions grows exponentially in the number of classes n . As a remedy, Noarov and Roth (2023) propose multiclass calibration with predictions that are properties of the outcome distribution, reducing complexity from growing in the number of classes n to the dimension d of the property, called its elicitation complexity. Previous work on approximate property calibration is generally limited to continuous scalar properties, despite many relevant properties of interest being discrete, like the mode or rankings. We characterize the approximate property calibration of discrete properties which are strongly orderable by using Lipschitz continuous properties as an intermediary. This work is the first to our knowledge to provide approximate calibration results for discrete properties. Along the way, we characterize the Lipschitz elicitation complexity of strongly orderable discrete properties by constructing algorithms for designing these Lipschitz properties, which we prove can be post-processed to obtain the original discrete property.

[LG-65] Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

链接: https://arxiv.org/abs/2605.22967
作者: Benjamin Rozonoyer,Jacopo Minniti,Dhruvesh Patel,Neil Band,Avishek Joey Bose,Tim G. J. Rudner,Andrew McCallum
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures. Equal contribution: Benjamin Rozonoyer, Jacopo Minniti, and Dhruvesh Patel. Code: this https URL

点击查看摘要

Abstract:When Masked Diffusion Models (MDMs) generate sequences through iterative refinement, the rich internal computation over masked positions is discarded, forcing every subsequent refinement step to recompute the valuable internal information stored as model representations. To avoid a hard reset between denoising rounds, we propose Learned Relay Representations (Relay), a method that allows MDMs to be forward-thinking when denoising by explicitly learning how to propagate latent information for the benefit of future denoising steps. Relay introduces a differentiable per-token channel that passes information between forward passes and is trained via truncated backpropagation through time (BPTT). We show that this framework can be scaled to state-of-the-art Diffusion Language Models (DLMs), and is seamlessly compatible with techniques like block diffusion and KV caching. We first provide a thorough justification of the design choices in Relay on a challenging Sudoku-based planning task. We then scale Relay to Fast-dLLM v2, a state-of-the-art DLM, outperforming standard supervised finetuning on coding tasks while reducing inference latency by up to 32%. Our empirical results demonstrate that state-of-the-art DLMs can be explicitly trained to relay latent information forward across decoding steps, advancing the performance-latency Pareto frontier. We provide code for all our experiments.

[LG-66] Certification from Examples is Hard for Circuits and Transformers under Minimal Overparametrization

链接: https://arxiv.org/abs/2605.22964
作者: Artur Back de Luca,Kimon Fountoulakis
类目: Machine Learning (cs.LG)
*备注: 38 pages, 5 figures

点击查看摘要

Abstract:As state-of-the-art neural networks are deployed on reasoning and algorithmic tasks, exactness guarantees become increasingly important. However, high average-case accuracy can still mask inconsistent behaviors. This motivates exact certification, which asks for the smallest set of labeled examples needed to certify that a learned hypothesis equals the target. We show that while some hypotheses are easy to certify, even minimal overparametrization can make certification exponentially hard across several hypothesis classes. For threshold circuits of depth \ge 2 , adding a single extra gate can force certificate sizes exponential in the input dimension. We show an analogous hardness result for log-precision Transformers with only constant architectural overhead. We also characterize approximate certification, showing that allowing only polynomially many mistakes still requires exponentially large certificates, whereas constant relative-error guarantees can hide exponentially many mistakes. Empirically, we study certification for constructed circuits and trained Transformers for recognizing binary addition. While the constructed circuits instantiate the exponential barrier for certification, the trained Transformer analysis shows that imperfect models can evade detection by large uniformly sampled certificate candidates.

[LG-67] FederatedRSF : Federated Random Survival Forests for Partially Overlapping Medical Data

链接: https://arxiv.org/abs/2605.22954
作者: Maryam Moradpour,Jonas Harriehausen,Amirreza Aleyasin,Lion Philipp Wolf,Youngjun Park,Anne-Christin Hauschild
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 4 pages, 2 figures. Maryam Moradpour, Jonas Harriehausen, and Amirreza Aleyasin contributed equally to this work. Includes supplementary material

点击查看摘要

Abstract:Multi-center survival prediction can improve robustness and generalizability, yet privacy regulations and institutional governance often prevent pooling patient-level clinical and genomic data across institutions. In practice, deployment is further complicated by feature-space heterogeneity, in which sites collect different covariates or use different sequencing panels, resulting in only partially overlapping feature sets. We present FederatedRSF, a Python package that implements federated random survival forests, aggregating locally trained survival trees and redistributing only feature-compatible trees to each site, enabling inference with partial overlap without sharing raw data. We evaluate FederatedRSF on the GBSG2 breast cancer cohort distributed with the scikit-survival package, simulating feature heterogeneity across clients by withholding subsets of features, and assessing discrimination using Harrell’s concordance index (C-Index) under repeated cross-validation and site-splits. The results demonstrated that the federated model can achieve performance comparable to that of the centralized training setting.

[LG-68] FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning

链接: https://arxiv.org/abs/2605.22898
作者: Rachid Hedjam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning protocols face a structural trilemma: canonical server-based aggregation~\citemcmahan2017 creates a single point of failure and gradient inversion risk; decentralised ring-gossip alternatives~\citehu2019segmented expose classification heads to semi-honest peers via uninformed uniform weights; and personalised methods~\citecollins2021exploiting reintroduce central aggregation. No existing protocol simultaneously achieves server-free operation, permanently private heads, ring topology, and principled asymmetric neighbour weighting. We propose FIRMA (\textbfFIbonacci \textbfRing \textbfModel \textbfAggregation), a family of three progressively enhanced federated learning protocols: 1) \fibfl\ establishes the foundation: server-free ring aggregation with Fibonacci-weighted neighbour blending and permanently private classification heads. 2) \fibflp\ augments this with accuracy-gated neighbour suppression, selectively down-weighting poorly-converged peers while preserving the Fibonacci directional bias. 3) \fibflpp, the full system, completes the family with a 2-opt ring permutation that maximises adjacent-client class diversity, global ring coverage via K_g=\lceil N/2\rceil gossip passes, and cosine-annealed self-retention calibration. We establish a convergence rate bound and three supporting propositions governing normalisation, coverage, retention, and diversity optimality. Systematic experiments across 28 configurations – four benchmarks crossed with seven heterogeneity regimes – demonstrate that \fibflpp\ surpasses \fedavg\ in all 12 label-skew configurations, with a peak advantage of +20.7 ,pp on CIFAR-10 at K=1 . Under Dirichlet heterogeneity, \fibflpp\ is the Pareto-dominant method among all server-free protocols, achieving the highest accuracy in 17 of 28 configurations.

[LG-69] From Residuals to Reason s: LLM -Guided Mechanism Inference from Tabular Data

链接: https://arxiv.org/abs/2605.22897
作者: Mohammad R. Rezaei,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largely inspective: they answer “which features matter?” but do not articulate how features interact or refine explanations iteratively alongside human understanding. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing. We introduce Multi-Agent Residual In-Context Learning (MARICL), an agentic framework in which LLM agents analyze where a base-model fails, hypothesize missing structure from high-residual examples provided in context, and produce explicit correction terms refined through multi-turn textual gradient optimization. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets. To test whether these corrections reflect real structure or batch-specific noise, we freeze formulas learned on one experimental batch of the Cell-Free Protein dataset and apply them (with no retraining and no further LLM calls) to held-out batches. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count; direct evidence of mechanistic generalization.

[LG-70] SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control

链接: https://arxiv.org/abs/2605.22894
作者: Jingyan Zhang,Han Liang,Ruichi Zhang,Bin Li,Juze Zhang,Xin Chen,Jingya Wang,Lan Xu,Jingyi Yu
类目: Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project page: this https URL

点击查看摘要

Abstract:Controlling physics-based humanoids from natural-language instructions is a critical step toward general-purpose embodied agents. However, existing methods remain constrained by a tension between semantic expressiveness and physical feasibility, often failing to jointly achieve faithful instruction following, high-quality motion, and stable long-horizon control. We propose SCRIPT, a scalable diffusion policy with a multi-stage training framework for language-driven physics-based humanoid control. The core of SCRIPT is a Joint Action-State-Text Diffusion Transformer (JAST-DiT), which represents actions, physical states, and text as dedicated token streams and couples them through joint attention, enabling direct interaction between language semantics and control dynamics. To stabilize autoregressive control, we introduce a nonlinear history conditioning mechanism, which preserves the dense recent context and samples increasingly sparse cues from long-term history. Beyond supervised imitation pre-training, we propose a post-training stage, further improving the performance using Reinforcement Learning with Hybrid Rewards (RLHR). By injecting learnable noise into the flow-sampling process, RLHR effectively improves motion quality and instruction following within closed-loop simulations using hybrid physical feedback and text rewards. Quantitative evaluations demonstrate that SCRIPT outperforms prior state-of-the-art methods, with gains across text alignment, motion quality, and physical realism metrics. Furthermore, scaling studies on the 1200-hour MotionMillion dataset demonstrate consistent performance gains with model scaling, highlighting SCRIPT’s robust scalability for large-scale pre-training. Our code will be publicly available for future research.

[LG-71] Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

链接: https://arxiv.org/abs/2605.22891
作者: Mads H. Baattrup,Jörn Bach,Laurids Jeppe,Finn Labe,Alexander Grohsjean,Christian Schwanenberger,Peer Stelldinger
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 29 pages, 9 figures, and 8 tables (including appendix)

点击查看摘要

Abstract:Evaluation in scientific reconstruction is dominated by pointwise metrics - RMSE, MAE, per-event resolution - under the implicit assumption that lower error means better reconstruction. We show that this assumption fails structurally for inverse problems with multimodal posteriors. By the law of total variance, point estimators trained to minimize MSE or MAE produce a marginal spectrum strictly narrower than the truth whenever the posterior has nonzero width. The resulting bias is independent of architecture, training, and dataset size, and it compresses precisely the spectral features - tails, modes, shapes - that downstream scientific measurements rely on. We propose a three-part evaluation protocol where each step targets a failure mode the others miss: per-event distributional accuracy via CRPS, population-level marginal accuracy via a spectrum-fidelity diagnostic, and uncertainty trustworthiness via coverage-based calibration. On a synthetic benchmark with an analytic posterior and on a realistic many-to-one inverse problem from particle physics, model rankings reverse between pointwise and distributional metrics, and calibration further separates architectures indistinguishable under CRPS. The evaluation protocol, not the model, determines the scientific conclusion.

[LG-72] Resilience Characterization of AI-Native Wireless Receivers via Persistent Homology

链接: https://arxiv.org/abs/2605.22886
作者: Christo Kurisummoottil Thomas,Emilio Calvanese Strinati
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:AI-native wireless receivers based on deep learning exhibit remarkable performance under stationary channel conditions, yet their resilience to distributional shifts remains poorly characterized by conventional metrics such as bit error rate (BER). To overcome these limitations, this paper proposes a novel real-time metric, the Topological Resilience Index (TRI), grounded in persistent homology and persistence exponents. TRI quantifies the structural stability of a neural network receiver’s parameter space during online adaptation to non-stationary channels. Specifically, TRI captures resilience through three complementary dimensions: (i) validation-loss resilience measuring model-channel mismatch, grounded in the topological persistence of loss-landscape sublevel sets; (ii) channel impulse response (CIR) distribution shift, tracking geometric drift of CIR vectors from the calibration reference distribution; and (iii) channel manifold topology, quantified by the spectral gap of the Gaussian kernel matrix normalized by the Olivier-Ricci curvature norm. We establish theoretical guarantees showing that TRI is bounded, monotonic under performance degradation, and Lipschitz-stable with respect to perturbations in channel distributions measured in Wasserstein distance. Simulation results for an OFDM deep-learning receiver adapting across ten ITU-R inter-environment transitions at three shift rates demonstrate that TRI provides a consistent mean warning lead of more than one OFDM symbol over gradient-norm and validation-loss baselines, whereas the gradient-norm baseline achieves zero lead in every scenario. Furthermore, the proposed TRI-guided burst re-adaptation reduces post-shift BER by 80% relative to no adaptation within 200 OFDM symbols.

[LG-73] WeCon: An Efficient Weight-Conditioned Neural Solver for Multi-Objective Combinatorial Optimization Problems

链接: https://arxiv.org/abs/2605.22876
作者: Xuan Wu,Jinbiao Chen,Yang Li,Lijie Wen,Chunguo Wu,Yuanshu Li,Yubin Xiao,Chunyan Miao,You Zhou,Di Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing neural solvers for Multi-Objective Combinatorial Optimization Problems (MOCOPs) commonly adopt decomposition-based strategies that scalarize an MOCOP into multiple subproblems associated with distinct weight vectors. However, they either inject weights only once during decoding, limiting weight-conditioned context modeling, or primarily during encoding, causing weight-signal dilution during decoding. Moreover, preference optimization methods rely on purely random sampling to construct solution pairs for training solvers, which often produces less informative pairs and thus leads to low training effectiveness. To better address these limitations, we propose an efficient Weight-Conditioned neural solver (WeCon). Specifically, we design an encoder layer with three attention blocks and our proposed Gated Residual Fusion (GRF) block to facilitate harmonious interaction between instance features and weights, thereby generating informative weight-conditioned context. We further introduce a plug-and-play Residual Fusion (RF) block in the decoder to alleviate weight-signal dilution. Finally, we propose Efficient Preference Optimization (EPO), which constructs high-quality solutions, thereby generating more informative pairs to improve training effectiveness. Experiments on four MOCOP variants across different problem scales and distribution patterns demonstrate that WeCon achieves HyperVolume (HV) values comparable to SOTA solver POCCO-W, while reducing inference time by 40%. Ablation studies validate the contributions of all designs.

[LG-74] FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

链接: https://arxiv.org/abs/2605.22869
作者: Yequan Zhao,Ruijie Zhang,Liyan Tan,Niall Moran,Tong Qin,Zheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Both full fine-tuning (Full FT) and parameter-efficient fine-tuning methods such as LoRA introduce weight updates without accounting for the spectral structure established during pretraining. As a result, noisy gradients from limited fine-tuning data can perturb robust pretrained features. We identify spectral preconditioning as the missing ingredient: reparameterizing each weight matrix through its full-rank singular value decomposition (SVD) and freezing one singular basis constrains updates to the pretrained column space, yielding a preconditioned optimization scheme that outperforms unconstrained Full FT at the same trainable parameter count. Building on this insight, we propose FuRA (Full-Rank Adaptation), an efficient full-rank adaptation framework based on a block tensor-train factorization W = LSR, where the large core L is fixed to the pretrained block-wise SVD basis, while only the compact core R and the block-wise singular values S are optimized. This design simultaneously provides full-rank spectral preconditioning, preserves full-rank update expressivity, and achieves parameter, memory, and step-time efficiency comparable to LoRA. FuRA consistently outperforms Full FT across multiple settings, including LLM fine-tuning (+1.37 on LLaMA-3-8B commonsense reasoning), LLM reinforcement learning for mathematical reasoning, and visual instruction tuning for VLMs. Furthermore, the 4-bit quantized variant, QFuRA, also surpasses QLoRA. Code is available at this https URL

[LG-75] FusionSense: Tri-Stage Near-Sensor Learning for Runtime-Adaptive Multimodal Edge Intelligence

链接: https://arxiv.org/abs/2605.22868
作者: Sanggeon Yun,Ryozo Masukawa,Minhyoung Na,Hyunwoo Oh,Yoshiki Yamaguchi,Wenjun Huang,SungHeon Jeong,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注: Accepted to ISLPED 2026

点击查看摘要

Abstract:Autonomous systems and smart-industry deployments increasingly split computation across near-sensor, edge, and cloud resources, where tight energy, latency, and reliability budgets demand run-time adaptivity. In practice, deciding what to compute and transmit at each point is pivotal; yet as multimodal sensor suites (cameras, LiDAR/depth, etc.) proliferate at the edge, most prior approaches either (i) fuse modalities on powerful servers or (ii) apply uni-modal near-sensor filters that ignore cross-modal dependencies, leading to redundant transmissions or missed events. We present FusionSense, a fusion-aware intelligent sensing framework for energy-constrained autonomous edge systems. Lightweight near-sensor classifiers are trained via a three-step procedure: (i) a server-side fusion model learns the downstream task, (ii) filter-out-safe (FoS) labels quantify each modality’s necessity relative to the fused decision, and (iii) an edge-side fusion model is compacted by injecting near-sensor predictions as auxiliary signals. The result is a run-time decision layer that jointly reduces compute and communication while scaling linearly with sensor count. On a dual-modality (RGB+Depth/LiDAR) setup with SynDrone, FusionSense sustains task quality at substantially higher data-reduction rates than uni-modal filters and delivers large end-to-end gains: up to 33x lower energy at 1% FoI prevalence, 11x at 10%, a 92.3% reduction in quality loss at a fixed 30% data reduction, and roughly 1.5x higher energy savings than the best prior filtering baseline.

[LG-76] Reading Calibrated Uncertainty from Language Model Trajectories

链接: https://arxiv.org/abs/2605.22864
作者: Aliai Eusebi,Alexander Herzog,Xiaoyu Liang,Marie Vasek,Enrico Mariconti,Lorenzo Cavallaro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The maximum softmax probability (MSP) represents a default approach when evaluating uncertainty quantification for language model generation with structured output. Although cheap, it is often miscalibrated. Methods that probe the model’s internal activations feed raw hidden states into opaque classifiers, reading activations as static snapshots and leaving implicit the layer-wise trajectory by which a representation is formed. Yet, similar endpoints can arise from very different paths, and how evidence accumulates, reinforces, or reverses across depth might reveal uncertainty that final probabilities obscure. We extract eleven scale-invariant geometric features, tracing the cumulative path of per-layer MLP updates, and feed them to a sparse linear probe. The probe outperforms MSP under selective abstention, with gains scaling with baseline miscalibration up to 21 AURC points. Because every feature has a closed-form geometric meaning, the probe’s coefficients trace how and where along depth errors take shape – which layers commit prematurely, which contradict the running state, where trajectories drift away from their endpoint.

[LG-77] Latent Cache Flow: Model-to-Model Communication Without Text

链接: https://arxiv.org/abs/2605.22863
作者: Maximillian Rossi,Prajwal Raghunath,Eugene Wu
类目: Machine Learning (cs.LG)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model’s state and encode at the receiver model. Recent work such as Cache-to-Cache (C2C; Fu et al., 2026) seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context. We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C’s size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a 13 MB LCF adapter can be more accurate than a 956 MB C2C adapter in shared-context settings; for different contexts, LCF is 23% more accurate and 8.5x faster than text-based communication. Comments: 6 pages, 5 figures Subjects: Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.6; I.2.11 Cite as: arXiv:2605.22863 [cs.LG] (or arXiv:2605.22863v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.22863 Focus to learn more arXiv-issued DOI via DataCite

[LG-78] From Simulation to Discovery: AI Enabled Probabilistic Emulation of Mechanistic Crop Systems

链接: https://arxiv.org/abs/2605.22848
作者: Mojdeh Saadati,Juan Panelo,Gustavo Visentini,Soumik Sarkar,Carlos Messina,Baskar Ganapathysubramanian
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT)
*备注:

点击查看摘要

Abstract:Global food security depends on predicting crop responses to climate variability, yet process based crop models remain too computationally expensive for large scale exploration of genotype and environment interactions. Here we develop a probabilistic neural emulator of APSIM that reproduces key maize growth processes across 13 outputs with high fidelity (with R^2 of 0.93) while reducing simulation time by several orders of magnitude. Trained on two million simulations spanning diverse genetic, soil, and management conditions, and augmented with a convolutional synthetic weather generator that produces physically consistent climate sequences, the framework enables scalable exploration of crop responses under realistic and diverse environmental inputs while providing calibrated predictive uncertainty without costly Bayesian inference. Applying this framework across 100,000 trait configurations, six soil environments in Iowa and Illinois, and climate projections through the year 2100 under two emissions scenarios, we identify 181 maize trait combinations that consistently maintain high yield across all tested conditionsan analysis infeasible with the mechanistic model alone. We further show that radiation use efficiency and temperature driven root dynamics are dominant drivers of yield resilience. Notably, projected yield distributions vary substantially across locations, with some lower productivity sites exhibiting yield increases under future climate scenarios, indicating that climate change may reshape regional yield potential in nonintuitive ways. These results demonstrate how uncertainty aware emulation transforms mechanistic crop simulation from a computational bottleneck into an on demand discovery engine, one capable of interrogating the full genotype, environment and management space at a scale no process-based model can match.

[LG-79] Cross-attention-based bipartite graph neural network for coupled nodal and elemental field prediction in large-deformation sheet material forming

链接: https://arxiv.org/abs/2605.22845
作者: Yingxue Zhao,Haoran Li,Haosu Zhou,Tobias Pfaff,Nan Li
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finite element simulations of large-deformation sheet material forming involve node-element coupling between nodal kinematics and element-level deformation measures. Machine-learning surrogates can accelerate such simulations, but most graph-based models use node-centred representations. This representation is indirect for element-level quantities, which are often recovered from nodal predictions by interpolation or post-processing. It may also obscure the node-element coupling structure that underlies the finite element update. This work proposes a cross-attention-based bipartite graph neural network (CAtt-BiGNN) for coupled prediction of nodal displacement increments and elemental thinning. The graph represents mesh nodes and elements as distinct but connected entities, linked by directed node-element edges, so that nodal and elemental fields are predicted on their native discretisation domains. An edge-aware cross-attention processor conditions adaptive node-element coupling weights on geometric edge features, enabling bidirectional message passing between nodal kinematic states and elemental deformation states. A hierarchical extension, CAtt-BiUGNN, combines the CAtt-BiGNN with graph downsampling-upsampling to improve information propagation on larger meshes. Adaptive Gaussian noise is further evaluated as an optional rollout-stabilisation strategy. The models are tested on two representative forming cases with different graph sizes. CAtt-BiGNN improves the balance between displacement and thinning prediction relative to node-centred baselines and bipartite ablation variants, while CAtt-BiUGNN gives the strongest overall performance in the larger-graph setting. The results indicate that the proposed model provides an effective surrogate framework for large-deformation sheet material forming.

[LG-80] MELT: A Behavioral Trace Dataset for High-Risk Memecoin Launch Detection

链接: https://arxiv.org/abs/2602.13480
作者: Sihao Hu,Selim Furkan Tekin,Yichang Xu,Ling Liu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Launchpads have become the dominant mechanism for issuing memecoins, exposing investors to a new class of high-risk launches that existing rug-pull detection methods cannot capture. We argue that detecting these threats requires structured behavioral traces that underlie raw heterogeneous blockchain data, i.e., how insiders accumulate, coordinate, and unwind positions. To enable such analysis, we introduce MELT (MEmecoin Launch Trace, the first behavioral trace dataset for analyzing and detecting high-risk memecoin launches on Solana. MELT covers 41k+ memecoin launches with 200M+ transactions parsed into typed behavioral records that distinguish swaps, wash trades, transfers, and mints. Beyond per-account behaviors, MELT contributes bundle-trace data that links accounts controlled by the same entity, revealing that, on average, 36.5% of token supply is held by coordinated accounts, a concealment strategy that disguises the true ownership concentration from unsuspecting buyers. On top of these traces, MELT provides 122 behavioral features and risk-level annotations, enabling supervised learning at a population scale. We benchmark representative ML models on the high-risk launch detection task. Integrating their predictions into a simple memecoin selection strategy reduces investment loss significantly, demonstrating that behavioral traces can be translated into risk mitigation. Our dataset and code is available at this https URL.

[LG-81] On the Stability of Spherical Hellinger-Kantorovich Flows and Their Implications for Differential Privacy

链接: https://arxiv.org/abs/2605.23879
作者: Aratrika Mustafi,Soumya Mukherjee
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Gradient-flow sampling interprets a Gibbs distribution as the minimizer of an energy functional over probability measures and generates dynamics converging to this target. Under spherical Hellinger-Kantorovich (SHK) geometry, the flow couples transport and reaction and coincides with birth-death Langevin dynamics. In this work, we develop a perturbation theory for SHK gradient flows. For two potentials V and V^\prime , we compare the associated flows from a common initialization and quantify how potential discrepancies propagate over time. A uniform perturbation bound yields dimension-free, pointwise control of the log-likelihood ratio and Rényi divergence, while additional structure allows us to derive bounds for the KL divergence as well. We apply these results to approximate sampling for the exponential mechanism in differential privacy. The likelihood-ratio control provides explicit time-dependent Pure-DP guarantees for SHK-based samplers, while the KL bound yields Approximate-DP certificates via hockey-stick divergence. We also derive a utility bound separating intrinsic exponential-mechanism suboptimality from finite-time sampling error.

[LG-82] Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

链接: https://arxiv.org/abs/2605.23871
作者: Aratrika Mustafi,Soumya Mukherjee,Bharath K. Sriperumbudur
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form J(\rho)=R\left(\int F d \rho\right) , a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.

[LG-83] he physics of AI weather models

链接: https://arxiv.org/abs/2605.23778
作者: George Craig,Tobias Selz,Matthias Beylich,Kirsten I. Tempest
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Could it be that AI weather models are solving physical equations, although they may not be the equations used by conventional NWP models? We compute correlations of forecast skill and Centered Kernel Alignment, providing evidence that different AI weather models represent the atmosphere in similar ways, despite differences in architecture and capacity. We argue that the architecture and training of the AI models constrains the form of the physical laws that they might simulate. In particular, we propose that the models implement a particle description of the atmosphere, where the latent variables at each mesh point correspond to the position of a particle in the high dimensional latent space. We hypothesize that the movement of the particles follows a gradient flow in the latent space towards a minimum of a learned free energy functional. Analysis of the GraphCast and Aurora models show that they make changes on large spatial scales in the early processor layers and move to smaller scale with increasing layer depth, consistent with the gradient flow hypothesis.

[LG-84] Learning Kernel-Based MDPs from Episodic Preferential Feedback

链接: https://arxiv.org/abs/2605.23650
作者: Nikola Pavlovic,Sattar Vakili,Qing Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human feedback often arrives as preferences rather than calibrated numeric rewards, motivating reinforcement learning from preferential feedback, also referred to as reinforcement learning from human feedback (RLHF). We present a rigorous theoretical study of preference-only learning in episodic kernel MDPs. In each episode, the learner deploys two policies from a common start state and receives a single binary label indicating which trajectory is preferred, modeled by a Bradley–Terry–Luce link on the difference of cumulative (unobserved) rewards. Under kernel-based assumptions on the reward and transition functions (one of the most general models amenable to theoretical analysis) we develop preference-based value estimation and confidence sets tailored to end-of-episode this http URL prove high-probability regret bounds that scale sublinearly in the number of episodes, implying that the value of the learned policy converges to that of the optimal policy.

[LG-85] Dirichlet-Based Monte Carlo Dropout for Uncertainty Estimation in Neural Networks

链接: https://arxiv.org/abs/2605.23635
作者: Rouaa Hoblos(FEMTO-ST),Noura Dridi(FEMTO-ST),Noureddine Zerhouni(FEMTO-ST),Zeina Al Masry(FEMTO-ST)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional neural networks provide deterministic predictions without inherent uncertainty estimates. While Bayesian Neural Networks (BNNs) offer a principled approach to uncertainty quantification, their computational complexity limits scalability. Monte Carlo (MC) Dropout, initially introduced as a regularization technique, has been shown to approximate Bayesian inference by enabling probabilistic modeling through multiple stochastic forward passes. In this work, we enhance uncertainty estimation in deep learning by integrating a Dirichlet-based framework within MC Dropout. Specifically, we leverage the formulation proposed by Sensoy et al. (2018), where class probabilities are modeled using a Dirichlet distribution, allowing for a more informative uncertainty representation. The proposed approach maintains the computational efficiency of MC Dropout while improving the quality of uncertainty estimates. We discuss the theoretical foundations of our method and compare it with existing uncertainty quantification techniques. The results highlight the effectiveness of the proposed method in producing well-calibrated uncertainty estimates, offering a practical solution for uncertainty-aware deep learning models.

[LG-86] Asymmetric Scaling Laws from Sparse Features

链接: https://arxiv.org/abs/2605.23591
作者: John Sous,Michael Winer
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold – where the number of parameters is just sufficient to fit the training data – resulting in a loss curve governed by two distinct scaling exponents – one for the overparameterized regime and one for the underparameterized regime – with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.

[LG-87] Selective Ambulance Dispatch Under Contextual Travel-Time Uncertainty

链接: https://arxiv.org/abs/2605.23378
作者: Zikun Lin,Daniel Zhuoyu Long,Viet Anh Nguyen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ambulance response is time-critical in out-of-hospital cardiac arrest (OHCA), where dispatchers must balance timely arrivals with limited fleet capacity. Static territories and deterministic travel-time estimates are vulnerable to dynamic congestion, while always-dual dispatch adds redundancy but consumes fleet capacity. We propose IDEAL (Intelligent Dual dispatch of Emergency AmbuLances), a selective dual-dispatch framework that sends a second ambulance only when the optimistic gap between primary and secondary paths exceeds a threshold. IDEAL learns context-specific edge travel times from trip-level dispatch records, including unobserved routes, using a weakly supervised bilevel representation network. We train the nonsmooth model with mini-batch conservative gradients and prove an asymptotic convergence guarantee. IDEAL models uncertainty via Burg-divergence perturbations to a shared metric in the learned representation space, thereby inducing correlated changes in edge travel times and learning context-specific radii from historical underprediction errors. For real-time decisions, IDEAL casts optimistic-gap computation as a difference-of-convex program and derives an efficient oracle with complexity guarantees. In collaboration with the Hong Kong Fire Services Department, we evaluate IDEAL using historical OHCA records and real-time adaptive simulations. The results achieve a stronger response-time/resource trade-off relative to all region-based and Google-based baselines.

[LG-88] SpinFlow: A Physics-Informed Spin Field Framework for Traffic Phase Inference and Transition Detection ITSC2026

链接: https://arxiv.org/abs/2605.23306
作者: Haopeng Deng,Fucheng Zheng,Xinhai Xia
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 11 pages, 8 figures, accepted to ITSC 2026

点击查看摘要

Abstract:Active traffic management (ATM) is frequently hindered by traditional macroscopic models and rigid empirical thresholds that fail to capture metastable phase precursors, resulting in delayed, reactive interventions. To address this, we propose SpinFlow, a physics-informed spin-field framework unifying Kerner’s three-phase theory with statistical physics for continuous macroscopic traffic phase inference. Inspired by the Heisenberg model, SpinFlow parametrizes spatially varying phase weights via a latent spin vector and a competitive-equilibrium mapping, allowing synchronized flow to emerge naturally. A physics-regularized Expectation-Maximization algorithm inverts this latent structure from high-resolution trajectories, jointly optimizing the spin field while softly enforcing mass conservation and spatial smoothness. We introduce the Phase Equilibrium Degree (PED) to quantify structural alignment and topologically localize phase-transition points. Across four real-world trajectory datasets, SpinFlow achieves R_q^2 up to 0.940, PED drops of 94.9-100%, and interpretable phase maps that outperform three heterogeneous baselines on forward accuracy, physics consistency, and bottleneck localization. SpinFlow pinpoints congestion nucleation without prior network topology, yielding a data-driven, physics-consistent trigger for ATM.

[LG-89] Accelerating ground state search of spatial photonic Ising machines with genetic-simulated annealing hybrid algorithm

链接: https://arxiv.org/abs/2605.23295
作者: Ze Zheng,Ruhui Ni,Jingyi Zhao,Xiaojian Hu,Wen Jiang,Yuegang Li,Hang Xu,Tailong Xiao,Guihua Zeng
类目: Optics (physics.optics); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Spatial photonic Ising machines (SPIMs) based on spatial light modulators (SLMs) have emerged as highly effective solvers for many tasks, including combinatorial optimization problems and spin-glass simulations. However, traditional SPIMs relying solely on the simulated annealing algorithm require a large number of measurement-feedback iterations to find a relatively optimal solution in complex energy landscapes, suffering from slow convergence and high time cost. Here, we propose an optical genetic-simulated annealing hybrid algorithm to accelerate the ground-state search of SPIMs. GA conducts a global coarse-grained search in the early iteration stage, while SA performs fine-grained local refinement in the late stage. Numerical simulations show that our method enables a higher solution quality of full-rank Max-Cut problems than pure GA or SA at different scales. We also experimentally demonstrate its superiority over conventional algorithms on a gauge-transformation time-division multiplexing SPIM for high-rank optimization problems under the same iteration budget. Our approach can be further developed with other advanced metaheuristic algorithms toward intelligent optical Ising computing systems.

[LG-90] Coupled Training with Privileged Information and Unlabeled Data ICML2026

链接: https://arxiv.org/abs/2605.23268
作者: Jiahao Shi,Omar Hagrass,Jason M. Klusowski
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 37 pages, 6 figures. Accepted to ICML 2026

点击查看摘要

Abstract:In many prediction problems, we have extra information during training (for example, measurements that are expensive or slow to collect) that will not be available when the model is deployed. A common strategy is to first train a model that uses all training information, then use its predictions on unlabeled examples to train a second model that only uses the inputs available at test time. However, when the extra training-only information is weak or noisy, this Two-Stage approach can mislead the deployment model and even hurt accuracy. We propose a joint training method that learns the two models together, so the deployment model can benefit from the extra information only when it actually helps, instead of inheriting its mistakes. We provide guarantees that describe when joint training improves prediction accuracy and analyze a simple alternating training algorithm for large, high-dimensional models. Experiments on synthetic data and real-world prediction tasks show that our approach avoids these failures and robustly outperforms standard Two-Stage baselines.

[LG-91] Operationalizing Individual Fairness via Gradient Descent and Bradley-Terry Models

链接: https://arxiv.org/abs/2605.23145
作者: Conlan Olson,Linjun Zhang,Zhun Deng,Pragya Sur
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 60 pages, 2 figures

点击查看摘要

Abstract:Individual fairness, the notion that “similar individuals should be treated similarly,” provides a strong and flexible fairness guarantee for algorithmic decision makers. However, a barrier to implementing individual fairness in practice is the difficulty of learning the similarity metric over individuals. In this work, we present an algorithm for learning a Mahalanobis similarity metric from triplet queries of the form “is individual i more similar to individual j or k ?” We work in the standard Bradley-Terry model for pairwise comparisons. Our algorithm consists of a spectral initialization step followed by gradient descent. We provide extensive theoretical guarantees on our algorithm, showing that it converges quickly to the ground truth metric despite the non-convexity of the loss in our model. Because our focus is on fairness, we also show that individual fairness with respect to an estimated metric is sufficient to achieve similar fairness with respect to the true metric. We also discuss potential applications of our work to AI model tuning. Finally, we present experimental results that demonstrate the convergence of our algorithm and the fairness performance of downstream fair predictors trained on our estimated metric.

[LG-92] LLM Sparsity Prior for Robust Feature Selection

链接: https://arxiv.org/abs/2605.23102
作者: Caleb Skinner,Yihan Guo,Meng Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) offer a scalable mechanism to elicit domain-informed prior information for high-dimensional variable selection. However, existing methods such as LLM-Lasso are sensitive to weight quality, with performance degrading substantially when LLM-generated weights are inaccurate. To address this challenge, we first introduce a framework for quantifying the quality of LLM-generated weights, enabling rigorous evaluation of LLM-informed methods across varying weight regimes. We then propose the LLM Sparsity Prior (LSP), which integrates LLM-generated weights into the prior inclusion probabilities of Spike-and-Slab and Spike-and-Slab Lasso models via two interpretable hyperparameters governing global sparsity and weight concentration. Hierarchical hyperpriors on these parameters allow the model to dynamically discount uninformative or misleading weights, improving robustness without sacrificing gains when weights are accurate. Finally, we develop principled prompt engineering strategies and validate the method on a private medical dataset studying Acute Kidney Injury. LSP improves prediction accuracy and identifies clinically relevant features missed by the baselines, with robustness to prompt variation and particular effectiveness in low-data regimes.

[LG-93] Active Sensing Subserves Task-Level Control

链接: https://arxiv.org/abs/2605.22988
作者: Andrew Lamperski,Debojyoti Biswas,Eric S. Fortune,John Guckenheimer,Kathleen Hoffman,Noah J. Cowan
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Active sensing is traditionally defined as the expenditure of energy, typically in the form of movement, for obtaining information. Here, we propose that the combination of reliance on adaptive sensors, the linkage between movement and sensing, and task-level control inevitably gives rise to the emergence of active sensing movements. In this way, active sensing is not driven by sensory goals, such as minimizing uncertainty about the state, but rather is necessary for task-level control. This hypothesis, that active sensing subserves control, is supported by both empirical data from organisms and mathematical theory. Interestingly, active sensing behaviors often occur in discrete epochs, interspersed with goal-oriented behavior. This suggests that animals switch between two behavioral modes with distinct control policies, an explore' mode in which animals produce dynamic movements to shape sensory feedback, and an exploit’ mode in which animals produce slower compensatory movements that are directly related to achieving task goals. This strategy for feedback control that relies on adaptive sensors, active sensing, and mode switching is not commonly used in engineered systems despite being ubiquitous in biology. Engineered systems comprising state-of-the-art sensors, actuators, and mechanical designs can outperform animals with respect to ``cost functions’’ such as maximum force generation, precision, and speed. Nevertheless, animals routinely achieve robust, graceful behaviors that are currently unmatched by engineered systems, suggesting that current control systems are insufficient. These insights, expressed in the language of control theory, may be critical for improving robotic sensing and control.

[LG-94] Uncertainty-aware classification and triage of structural heart disease using electrocardiography and echocardiography metrics

链接: https://arxiv.org/abs/2605.22968
作者: Mitchel J. Colebank
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Machine learning methods provide a methodological innovation that can help screen for cardiovascular disease through noninvasive and readily available measurement modalities. Recent investments in using electrocardiogram (ECG) data to screen for structural heart disease (SHD) are one example, where ECGs provide a low-cost, available modality for screening. This has led to the EchoNext dataset, a paired ECG-echocardiogram data repository for testing new methods of SHD detection. However, relatively few studies have investigated how more probabilistic classification through Bayesian inference may improve uncertainty quantification in this setting. Moreover, few studies have considered how triage systems can be developed to alleviate healthcare bottlenecks, such as the review of data from underserved, rural clinics by expert sonographers for SHD assessment. In this study, we leverage existing ECG-echocardiogram data to compare frequentist and Bayesian neural network classifiers. We show that the Bayesian approach is comparable or better than frequentist methods in SHD classification, and that they have a more robust uncertainty quantification attached to them. We provide an example of how this uncertainty-aware classification scheme can be used for screening SHD, providing a proof-of-concept for how machine learning can help with triage in getting individuals expert sonographer input when SHD is highly likely or measurements are highly uncertain.

[LG-95] Diffusion-based Denoising Beats Vanilla Score Matching in Parameter Estimation: A Theoretical Explanation

链接: https://arxiv.org/abs/2605.22950
作者: Benedikt Lütke Schwienhorst,Nadja Klein,Johannes Lederer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Score matching is an alternative to maximum likelihood estimation when the normalizing constant is unknown or too costly to evaluate. However, vanilla score matching has shown to be inefficient relative to maximum likelihood estimation for multimodal distributions with well-separated modes, which are commonly encountered in practical applications. We compare a novel diffusion-based denoising score matching estimator (DDSME) to the vanilla score matching estimator (SME) in this scenario. In particular, we prove statistical guarantees for both estimators, showing that the error bound for the vanilla SME worsens when the separation between the modes increases, which can be avoided in case of the DDSME with suitable hyperparameter tuning. This provides a novel theoretical explanation for the superior behavior of diffusion-based score matching over the vanilla version.

[LG-96] L-FAME: Longitudinal Focused Attention Meditation EEG Dataset and Benchmark

链接: https://arxiv.org/abs/2605.22893
作者: Angqi Li,Ab Basit Rafi Syed,Hamzeh Alzweri,Taosheng Liu,Barry H. Cohen,Saiprasad Ravishankar
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Code and dataset available at: this https URL

点击查看摘要

Abstract:We introduce a novel Longitudinal Focused Attention Meditation Electroencephalography (L-FAME) dataset and an accompanying benchmark, designed to foster research into the neural effects of various meditation practices and the evolution of these effects over a six-week training period. The dataset contains EEG recordings and psychological assessments from 74 healthy college participants, collected at two distinct time points: pre-intervention and post-intervention. Participants were randomly assigned to one of three distinct meditation groups: two mantra-based techniques (SA-TA-NA-MA and Hare Krishna) and one Breath Focus practice. Leveraging this unique longitudinal and comparative dataset, we propose a benchmark suite comprising three distinct classification tasks: (1) cognitive state decoding to distinguish between resting and meditation states, (2) fine-grained classification of the specific meditation techniques, and (3) cross-session adaptation to evaluate model generalization across the longitudinal time gap. We provide comprehensive baseline results for these tasks utilizing a range of classical machine learning algorithms and deep learning architectures. The complete dataset, preprocessing pipelines, and benchmark evaluation code will be publicly released, offering a valuable resource and a standardized framework for the development and comparison of new analytical methods in computational meditation research and EEG-based machine learning. The dataset is available at this https URL

[LG-97] Is TabPFN the Silver Bullet for Insurance Pricing?

链接: https://arxiv.org/abs/2605.22892
作者: Bruno Deprez,Wouter Verbeke,Tim Verdonck
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modelling claim frequency and severity for non-life insurance pricing predominantly relies on generalised linear models, with gradient-boosted machines as the leading machine learning alternative. Tabular foundation models (TFMs) offer a fundamentally different paradigm. By pre-training on large collections of synthetic datasets, TFMs enable inference on new data through in-context learning, without any dataset-specific fitting or hyperparameter tuning. This paper presents a first empirical evaluation of TabPFN for motor insurance pricing, benchmarking it against GLM and XGBoost on two publicly available MTPL datasets. Our results show that TabPFN does not consistently outperform established baselines, exhibits substantially longer inference times, and is sensitive to the size of the in-context training set. While tabular foundation models represent a promising direction, particularly in data-scarce settings, their current formulation does not offer a viable replacement for established actuarial methods.

[LG-98] Classification of IED-free EEG Responses for Assisted Epilepsy Diagnosis

链接: https://arxiv.org/abs/2605.22858
作者: Giacomo Zanardini,Ryan Moesman,Paul van der Kleij,Robert van den Berg,Justin Dauwels
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted at IEEE EMBC2026

点击查看摘要

Abstract:Diagnosing epilepsy is challenging when routine EEGs lack interictal epileptiform discharges (IEDs). Intermittent photic stimulation (IPS) and hyperventilation (HV) can increase diagnostic yield, but their interpretation is subjective. We propose a reproducible pipeline that classifies EEG recordings acquired during stimulation procedures, using machine-learning features spanning temporal, spectral, wavelet, and connectivity domains, and a stacked ensemble to combine complementary feature sets. Performance is evaluated with leave-one-subject-out (LOSO) cross-validation on the TUH Epilepsy Corpus and a clinical Erasmus MC (EMC) cohort, including IED-free analyses on TUH. On TUH, ensembles achieve up to 97.8% AUC / 93.1% BAC on IED-free resting-state EEG and 94.1% AUC / 86.8% BAC on IED-free IPS. On EMC, IPS provides the strongest discrimination (79.4% AUC / 73.9% BAC), while HV performance benefits from stratifying subjects by responsiveness. These results indicate that stimulation-evoked activity, particularly IPS, contains meaningful discriminative information for IED-free epilepsy classification and that multi-domain ensembling improves robustness.

[LG-99] JointHRRP-Net: A Statistically Constrained Decoupling Network for Joint Target and Jamming Recognition in Composite Jamming

链接: https://arxiv.org/abs/2605.22857
作者: Yunfei Zhao,Mei Liu,Shuowei Liu,Xunzhang Gao,Yujie Zhou
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing (TGRS). 15 pages, 12 figures

点击查看摘要

Abstract:High-resolution range profile (HRRP)-based radar automatic target recognition suffers from severe performance degradation in composite jamming environments. Active jamming introduces suppression- and deception-related components into the received range profile. After pulse compression, these components are coupled with target echoes in the HRRP domain, making target-related scattering peaks difficult to distinguish and weakening feature separability. To address this problem, this paper proposes JointHRRP-Net, a unified framework for joint target-jamming recognition. A statistically constrained decoupling module is first developed to generate target-dominant and jamming-dominant latent branches from the mixed HRRP representation. Correlation-guided statistical constraints are imposed to suppress redundant cross-branch information and alleviate target-jamming feature entanglement. A multi-scale temporal encoding module is then designed to model local scattering structures and long-range range-cell dependencies, followed by a dual-expert decision module for single-label target classification and multi-label jamming classification. Experiments under diverse signal-to-jamming ratio (SJR) and signal-to-noise ratio (SNR) levels demonstrate that JointHRRP-Net outperforms representative baseline methods in both target recognition and composite jamming recognition. Open-set evaluation further shows that the learned target representation remains discriminative for unknown-target rejection. These results demonstrate the effectiveness and robustness of JointHRRP-Net in composite jamming scenarios.

[LG-100] opological Signal Processing: An Application-Oriented Tutorial

链接: https://arxiv.org/abs/2605.22853
作者: Flavia Petruso,Maria Giulia Preti,Dimitri Van De Ville
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Many modern datasets are large and carry complex structural relationships. Graph-based methods have traditionally been used to represent networked data, modeling individual elements as nodes and pairwise interactions as edges. Furthermore, Graph Signal Processing (GSP) has been developed to analyze signals on graph nodes, such as temperature measurements (node signals) across different regions of a country represented as a graph. Topological Signal Processing (TSP) is an emerging field that generalizes GSP, enabling the analysis of signals defined not only on nodes but also on edges, triangles, and higher-dimensional network elements, modeled as simplicial complexes and related topological structures. This makes TSP naturally well-suited for studying higher-order interactions in complex systems by extending classical signal processing concepts, such as filtering and Fourier transforms, to the topological level. Despite its versatility, TSP remains challenging for many practitioners. Therefore, we present an accessible overview of TSP foundations while drawing connections with application-oriented settings. We focus on processing techniques based on the combinatorial Hodge Laplacian, which generalizes the graph Laplacian to simplicial complexes. In particular, we review key TSP concepts, relate them to real-world examples, and discuss how higher-order structures and signals can be derived from datasets. For instance, we introduce an edge-level signal capturing lagged interactions between nodal signals, and demonstrate its use in a case study on TSP-based analysis of brain imaging data, revealing nontrivial interactions between sets of brain regions. Overall, we aim to promote a broader adoption of TSP by bridging methodological developments with applications, fostering its use among a wide community of theoretical and applied researchers.

[LG-101] VAMP-Diff: VampPrior Latent Diffusion for Photoplethysmography Modeling

链接: https://arxiv.org/abs/2605.22851
作者: Fatemeh Ghasemi Balouei,Nathan Willemsen,Mahesh Banavar,Bahman Moraffah
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Submitted to the 2026 Asilomar Conference on Signals, Systems, and Computers. 12 pages, 6 figures

点击查看摘要

Abstract:Photoplethysmography (PPG) has become a ubiquitous physiological signal; however, current generative models still struggle to preserve realistic waveform morphology and learn a latent structure that captures cardiac and respiratory physiology. PPG generators trained with adversarial losses can produce plausible waveforms, but provide no inference path from a real signal to a latent representation. Variational autoencoders, on the other hand, map the PPG data to latent codes, although their decoders often blur systolic upstrokes and dampen amplitude and spectral details. Diffusion models improve waveform fidelity, but typically lack an inference path for reconstruction and physiological analysis. We propose VampPrior Latent Diffusion (VAMP-Diff), a jointly trained variational diffusion model that combines a temporal PPG encoder, a conditional one-dimensional diffusion decoder, and VampPrior regularization on a compact pooled latent. The model uses full temporal latent during diffusion reconstruction, giving the decoder access to beat timing and morphology while generating samples from learned VampPrior components instead of a fixed Gaussian prior. We demonstrate on the CapnoBase dataset that VAMP-Diff produces realistic PPG signals, reconstructs sharper physiological waveforms than Gaussian-prior baselines, preserves heart-rate information, maintains respiratory-rate consistency, and is sensitive to waveform corruptions through reconstruction error.

[LG-102] Evaluating PhaseNet on Teleseismic Data with MsPASS

链接: https://arxiv.org/abs/2605.22837
作者: Jinxin Ma,Yinzhi Wang,Gary L. Pavlis,Chenbo Yin
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Numerous studies have shown that the machine-learning picker PhaseNet produces accurate P and S picks on local earthquake signals, but its performance can degrade sharply on teleseismic signals. To address this limitation, we present a reproducible MsPASS workflow that (i) enables scalable data preparation and management for large seismic archives and (ii) supports standardized PhaseNet training and inference. We assembled a control dataset of 1.6 million waveforms linked to teleseismic P-wave picks made by analysts at the USArray Array Network Facility (ANF). The control dataset confirms that the PhaseNet model trained on regional signals performs poorly on these data. We then trained PhaseNet from scratch on the training split of the ANF control dataset and evaluated it on a non-overlapping held-out test split, increasing P-pick recall by 741.5% and yielding 683.9% more picks within a 0.1s residual window. We also evaluated PhaseNet across different model sizes on both CPUs and GPUs. Increasing the model size by about 120 times improved precision and recall by 15.6% and 23.2%, respectively. However, the scaled model reduced inference throughput by 87.2% on an NVIDIA A100 GPU and by 97.3% on a 128-core high-performance CPU node. These results indicate that scaling PhaseNet is more practical on GPUs than on CPUs, and that simply enlarging the model is not an efficient way to achieve large accuracy gains.

[LG-103] Real-Time Earthquake Magnitude Classification from Initial P-Waves: Models Dataset and Comparative Analysis for South Asia

链接: https://arxiv.org/abs/2605.22836
作者: Md Nasiat Hasan Fahim,Md. Abid Ullah Muhib,Rayhanul Amin Tanvir,Abdullah Al Noman
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: Accepted for publication in 2025 28th International Conference on Computer and Information Technology (ICCIT). \c{opyright} 2025 IEEE

点击查看摘要

Abstract:Rapid earthquake magnitude estimation is crucial for effective early warning systems that can save lives and reduce economic damage. In this paper, we present a comprehensive study of magnitude classification using only the vertical component of the initial 7-second P-wave window from a single station. We compare six machine learning approaches that range from traditional models to state-of-the-art deep learning architectures. We also curated a novel dataset of 7,318 earthquake events in South Asia. The dataset was categorized into five Richter-scale classes: slight (3.0-3.9), light (4.0-4.9), moderate (5.0-5.9), strong (6.0-6.9) and severe (= 7.0). Our experiments show that deep learning models substantially outperform traditional approaches. Our Transformer based architecture achieved 76.23% standard accuracy and 81.56% adaptive accuracy with 4.8 ms inference latency. The adaptive-accuracy metric is introduced for the inherent uncertainty in magnitude estimation of near class boundaries. These results indicate that the attention mechanisms in Transformers combined with adaptive classification effectively capture the temporal dynamics of seismic signals. The architectural advantage facilitates promising generalization to rare high-magnitude events, despite the inherent data scarcity characteristic of seismic catalogs. The adaptive accuracy provides a more realistic assessment of model performance, and the result suggests viability for real-time deployment.

附件下载

点击下载今日全部论文列表

目录

概览 (2026-05-25)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载