Arxiv今日论文 | 2026-06-08

本篇博文主要内容为 2026-06-08 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共88篇(Computation and Language (cs.CL))
人工智能共166篇(Artificial Intelligence (cs.AI))
计算机视觉共113篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共174篇(Machine Learning (cs.LG))
多智能体系统共10篇(Multiagent Systems (cs.MA))
信息检索共17篇(Information Retrieval (cs.IR))
人机交互共17篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Modelling Opinion Dynamics at Scale with Deep MARL

【速读】：该论文旨在解决传统意见动态建模中依赖人工设计局部交互规则所导致的局限性问题，即难以准确捕捉复杂社会环境中由个体行为涌现的宏观现象（如共识与极化）。其核心解决方案是引入基于多智能体强化学习（MARL）的框架，通过让智能体在简单奖励机制下自主学习互动策略，实现对真实社会行为的直接模拟。关键创新在于构建了一个可扩展至1000个智能体的GPU加速共识与真理发现游戏，并将“他者博弈”（other-play）方法拓展至一般和博弈的社会互动场景，以避免不切实际的共谋行为。通过在Bluesky社交网络子集上的验证，模型利用学习得到的注意力机制仅从图拓扑结构中恢复出代理的重要性结构，结果表明高度顺从的群体最接近人类行为数据。然而，在大型社交媒体网络中，这种高顺从性会显著降低集体准确性，并助长说谎以迎合群体的不诚信行为；而在小型、动态的狩猎采集型网络中，顺从性反而有助于提升集体一致性。这一发现揭示了人类进化形成的顺从启发式与现代社交媒体环境之间的不匹配，可能是虚假信息传播的重要成因。

链接: https://arxiv.org/abs/2606.07487
作者: Lukas Seier,Brandon Kaplowitz,Sebastian Towers,Richard Bailey,Jakob Foerster
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Social and Information Networks (cs.SI)
备注: 35 pages, 28 figures, preprint

点击查看摘要

Abstract:Modelling opinion dynamics typically relies on hand-crafted local interaction rules to study emergent macroscopic phenomena such as consensus and polarisation. In contrast, multi-agent reinforcement learning (MARL) enables agents to learn such behaviours directly by optimising simple rewards. To explore the potential of MARL for opinion dynamics, we introduce a GPU-accelerated consensus and truth-finding game that scales to populations of up to 1000 agents, comparable to many real-world social sub-networks. To prevent unrealistic conventions, we extend other-play to general-sum social interactions. We next validate our model on a subset of the Bluesky network by recovering agent importance structures from graph topology alone via a learned attention layer, finding that highly conforming populations most closely match human data. In large social media networks such high levels of conformity significantly reduce collective accuracy and promote dishonest agents that lie to fit in. By contrast, small, dynamic hunter-gatherer networks are less affected; here, conformity can even improve collective agreement. This suggests a mismatch between evolved human conformity heuristics and modern social media environments as a potential contributor to misinformation.

[MA-1] Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM -Agent Collaboration

【速读】：该论文旨在解决大语言模型（Large-Language-Model, LLM）代理在拜占庭协作场景下缺乏可验证的终局性控制机制的问题。具体而言，当多个LLM代理生成具有随机性与结构化的自然语言提案时，系统需在每轮中判断是否支持提交（commit）、提交何种类型的承诺，或执行带类型理由的安全中止（typed safe abort）。传统方法中的朴素聚合策略将决策隐藏于单一结论之下，而经典拜占庭容错（Byzantine Fault Tolerance, BFT）依赖字节级一致性，无法适用于非字节对齐的自然语言输出。为此，本文提出分层认证语义提交（Hierarchical Certified Semantic Commitment, H-CSC），一种受BFT启发的协议：通过分析基于嵌入空间（embedding）推导出的终局信号，并结合以裁决条件为前提的提案分组，将结果转化为三类有类型输出——语义提交（semantic_commit，由2f+1个一致裁决内的语义核心支撑，生成量化聚合的参数约束摘要）、裁决提交（verdict_commit，具备强裁决优势但语义理由分散，仅生成裁决层级证书而不声称语义聚合）或显式中止（含类型化原因）。其核心贡献在于引入“有类型终局性”（typed finality），而非单纯追求提交准确率。实验表明，在受控语义污染诊断（BCS_v1，120轮）中，H-CSC在可行BFT桶内实现极低的角度偏差（0.31至2.04度），并100%正确中止超出BFT能力范围的轮次（n≥3f+1）；在真实世界LLM代理声明验证基准（MVR-50，50任务）上，面对成对静态与快速拜占庭攻击，其提交率分别达到0.90/0.92，且诚实参考无效率仅为0.02/0.00，统计上等同于强证书生成的纯裁决基线。更重要的是，H-CSC在74%/72%的轮次中还能生成嵌入支撑的语义提交摘要，提供可追溯的类型化出处。消融实验显示，若强制要求严格语义一致性，提交率下降至0.54/0.48，证明裁决级回退机制对于覆盖率提升至关重要（+0.36/+0.44），同时维持相同的0.04安全阈值；跨四模型100任务测试进一步验证了无效主裁决（invalid_hmaj）稳定在0.00至0.03之间，展现出良好的泛化能力。

链接: https://arxiv.org/abs/2606.07316
作者: Haoran Xu,Lei Zhang,Iadh Ounis,Xianbin Wang
机构: University of Glasgow(格拉斯哥大学); University of Western Ontario(Western Ontario大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 27 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte-identity that LLM proposals do not satisfy. We introduce Hierarchical Certified Semantic Commitment (H-CSC), a BFT-inspired protocol that converts embedding-derived finality signals over verdict-conditioned proposal groups into one of three typed outcomes: a semantic_commit (a 2f+1 within-verdict semantic core backs the verdict, emitting a parameter-bound digest over the quantised aggregate), a verdict_commit (strong verdict margin but dispersed semantic rationale, emitting a verdict-level certificate without claiming a semantic aggregate), or an explicit abort with a typed reason. The contribution is typed finality, not raw commit accuracy. On a controlled semantic-poisoning diagnostic (BCS_v1, 120 episodes), H-CSC commits with low angular deviation on BFT-feasible buckets (0.31 to 2.04 degrees) and aborts 100% of beyond-BFT rounds (n3f+1) as intended. On a real LLM-agent claim-verification benchmark (MVR-50, 50 tasks) under paired static and rushing Byzantine attacks, H-CSC commits 0.90/0.92 with honest-reference-invalid rates of 0.02/0.00, statistically matching a strong certificate-emitting verdict-only baseline. Unlike that baseline, H-CSC also emits an embedding-backed semantic_commit digest on 74%/72% of rounds, supplying typed provenance. A strict-semantic ablation commits only 0.54/0.48, showing the verdict-level fallback is necessary for coverage (+0.36/+0.44) at the same =0.04 safety floor; a 100-task cross-model check across four LLMs preserves invalid_hmaj within 0.00 to 0.03.

[MA-2] Learning Multi-Agent Communication Protocol: Study on Information Entropy Efficiency in MARL

【速读】：该论文旨在解决多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）中通信协议效率低下问题，即现有方法在追求性能提升时倾向于采用日益复杂的网络架构和更高的通信开销，却缺乏对信息交换效率的系统性度量与优化机制。其解决方案的关键在于提出一种名为信息熵效率指数（Information Entropy Efficiency Index, IEI）的新指标，该指标通过量化通信消息熵与任务性能之间的比率，以衡量通信协议的信息紧凑性与效率；较低的IEI值表明通信内容更精炼、信息密度更高。通过将IEI引入训练损失函数，引导智能体在保证甚至超越基线性能的同时，自发学习出更为高效的通信策略。实验结果表明，该方法在多种MARL算法上均实现了与或优于基准方法的任务表现，同时显著提升了通信效率，从而挑战了“性能提升必然依赖复杂结构或增加通信开销”的传统认知，为构建可扩展的高效多智能体系统提供了新的范式。

链接: https://arxiv.org/abs/2606.07200
作者: Xinren Zhang,Zixin Zhong,Jiadong Yu
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-Agent Systems (MAS) have emerged as a fundamental paradigm for distributed problem-solving, where autonomous agents collaborate to achieve complex objectives. Within this framework, Multi-Agent Reinforcement Learning (MARL) with communication has demonstrated remarkable success in cooperative tasks. However, existing approaches predominantly pursue performance gains through increasingly complex architectures and expanding communication overhead, lacking principled metrics to evaluate the efficiency of information exchange. In this paper, we focus on enabling agents to learn efficient multi-agent communication protocols that balance performance and information compactness. We propose the Information Entropy Efficiency Index (IEI), a novel metric that quantifies the ratio between message entropy and task performance in learned communication protocols. A lower IEI indicates more compact and efficient message representations. By incorporating IEI into training loss functions, we encourage agents to develop communication protocols that achieve high performance with improved communication efficiency. Extensive experiments across diverse MARL algorithms demonstrate that our approach achieves equivalent or superior task performance compared to baseline methods while improving communication efficiency. These findings challenge the prevailing assumption that performance improvements require complex architectures or increased communication overhead and highlight the potential of improving both task success and communication efficiency to enable scalable MAS.

[MA-3] From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

【速读】：该论文旨在解决生成式智能体（Generative AI agents）系统中通信图（communication graph）的元数据泄露问题，其核心挑战在于：尽管当前代理互操作协议（如A2A、MCP）通过基于地址的HTTP(S)传输保障了消息内容的机密性（甚至采用端到端加密），但通信图本身——包括代理间交互的实体、时间、频率等元数据——仍暴露在明文之中。这些元数据并非仅反映历史关系，而是具有高度语义性（semanticity）、前瞻性（prospectivity）和可执行性（actuation），即能揭示待执行的工作流、任务组装过程及即将发生的动作。在机器速度下，攻击者可利用此类信息进行预测性干预，从而破坏工作流完整性，而非单纯侵犯隐私。因此，论文提出一个针对代理通信图的威胁模型，识别出元数据的独特可推断性特征，并定义了传输层与引导层的隐私保护属性。研究评估了SimpleX/SMP、Tor、混洗网络（mixnets）等候选传输机制在满足这些属性上的表现，并通过一个基于真实A2A捕获数据集的生成模型案例分析表明：即使不访问消息载荷，仅凭被动元数据即可显著分类任务类型；而引入所提出的隐私保护机制后，分类性能被显著抑制至接近随机水平。此外，实验还量化了元数据泄露带来的攻击杠杆效应——在固定资源预算下，攻击者通过选择性干预可获得接近全知攻击者的收益，而所提隐私属性有效压制了这一优势。

链接: https://arxiv.org/abs/2606.07150
作者: Bijaya Dangol
机构: 独立研究者(Independent Researcher)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another, but assume address-based transport over HTTP(S). Such transports protect message content, increasingly with end-to-end encryption. What they leave in the clear is the communication graph: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are often capability-labeled, workflows are structured and chained, and interactions are coupled to real actions, so an observer recovers more than past relationships. It can infer the pending workflow, the task being assembled and the action likely to follow. At machine speed, it can act on that inference before the workflow completes. The threat is therefore one of workflow integrity, not privacy alone: predictive leverage over autonomous action. We give a threat model for the agent communication graph; identify what makes agent metadata distinctively revealing (semanticity, prospectivity, actuation); define transport- and bootstrap-layer privacy properties and weigh candidate transports (SimpleX/SMP, Tor, mixnets) against them; and present an A2A case study in which a metadata-protecting binding is expressible but surfaces the protocol’s identity assumptions. We test these on a generative model anchored to a real A2A capture. From passive metadata alone, with no payloads, a classifier recovers a task’s class well above chance, from only the workflow’s opening; applied together, the properties drive that recovery sharply back toward chance. Beyond what an observer can recover, we measure the leverage of acting on the leak: from a workflow’s opening and under a fixed budget, an adversary choosing which workflows to act on realizes in this model most of a clairvoyant attacker’s advantage over a metadata-blind one, and the same properties suppress it.

[MA-4] he Three-Ring Architecture: Governing Agents in the Era of On-Platform Organisations

【速读】：该论文旨在解决企业在部署生成式 AI（Generative AI）过程中面临的结构性失效问题：组织在缺乏相应治理基础设施的情况下引入了自主代理能力（agentic capability），导致类似第一波 AI 部署时出现的“去中心化智能但无联邦管控层”的困境，进而引发高达 95% 的项目失败率。其解决方案的核心是提出并验证“三环架构”（Three-Ring Architecture）作为平台内组织的治理基础设施。其中，第一环为现有生产架构；第二环为基于策略的代理联邦层（M2 federation layer），构成企业级自主代理的操作系统——在组织层面实现资源抽象、流程协调、权限控制与智能累积的稳定平台，类比于计算设备上的操作系统；第三环为基于大语言模型（LLM）的前沿智能层。该研究的关键贡献在于明确区分第二环与第三环的风险特征：基于策略的代理运行于确定性框架，其行为可追溯、权限可强制、偏差可恢复；而基于 LLM 的代理则属于非确定性实体，其偏差在复杂组织系统中传播且无法事后追溯，带来根本性风险升级。因此，第二环并非可选补充，而是实现控制与合规的必要条件。此外，随着 LLM 能力提升，其潜在偏差后果呈指数级扩大，治理需求随之增强，这使得该架构具备随技术演进而自我强化的结构性优势。该架构已在金融、政府、采购及合规等领域的十年实践中得到验证。

链接: https://arxiv.org/abs/2606.07119
作者: Sergio Alvarez-Telena,Marta Diez-Fernandez
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 28 pages

点击查看摘要

Abstract:The current phase of enterprise AI deployment faces a structural failure: organisations are acquiring agentic capability without the infrastructure to govern it. The result is expected to reproduce the error of the first wave of AI deployment: decentralised intelligence without a federation layer leading to a 95% project failure rate. This paper formalises the Three-Ring Architecture as the governing infrastructure of the on-platform organisation. Ring 1 is the existing production architecture; Ring 2 is the M2 federation layer built on strategies-based agentic AI; Ring 3 is the LLM-based frontier intelligence layer. Ring 2 constitutes, in the technically exact sense, the operating system of the agentic enterprise - performing at the organisational level what a computing OS performs at the device level: resource abstraction, process coordination, permission enforcement, and a stable platform for compounding intelligence. A central contribution is the formal distinction between Ring 2 and Ring 3 risk profiles. Strategies-based agents operate within a deterministic framework: their consequences are traceable, their permissions enforceable, their deviations recoverable. LLM-based agents introduce a categorically distinct risk: a non-deterministic actor whose deviations propagate through complex organisational systems without retrospective traceability. Ring 2 is not a useful addition - it is a necessary condition of control and compliance. A further consequence: every improvement in LLM capability is a structural tailwind for this architecture. More capable non-deterministic actors produce larger consequences when they deviate. The governance requirement scales with capability. The architecture has been validated across a decade of deployment in financial services, government, procurement, and compliance among other sectors.

[MA-5] Modeling U.S. Attitudes Toward China via an Event-Steered Multi-Agent Simulator

【速读】：该论文旨在解决现有基于大语言模型（LLM）的多智能体模拟器在模拟宏观层面意见动态演变时，因依赖静态规则与固定数据集而难以捕捉真实世界中事件驱动型、动态演化的公众态度变化这一关键问题。其解决方案的核心在于提出一种事件驱动的多智能体模拟框架（Event-Steered Multi-Agent Simulator, ES-MAS），通过引入持续更新的重大事件与日常新闻流，驱动智能体间的动态交互以实现意见演化。该方案的关键创新包括：构建覆盖2021至2025年共20个季度的中国—美国关系演化（China-U.S. Relation Evolution, CURE）数据集，提供高时间分辨率的历史基准；设计双流数据融合引擎（Dual-Stream Data Integration Engine, DSDIE），实现宏观事件对齐与个体化信息暴露的协同；并引入新闻驱动的动态交互模块（News-Driven Dynamic Interaction, NDDI），根据新闻兴趣自适应地将智能体聚类至局部互动情境，促进自下而上的共识形成，同时降低信息孤岛风险。实验结果表明，ES-MAS在重现真实历史趋势方面显著优于现有方法，为动态意见演化建模提供了可扩展且高效的技术框架。

链接: https://arxiv.org/abs/2606.06971
作者: Chenxu Zhu,Hantao Yao,Wu Liu,Junbo Guo,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学); People.cn (人民网)
类目: Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Understanding the dynamic evolution of opinions, such as U.S. public attitudes toward China, is essential for assessing geopolitical risks. However, existing LLM-based multiagent simulators predominantly rely on static rules and fixed datasets, limiting their ability to capture the dynamic, event-driven nature of macro-level opinion shifts in real-world settings. To address this limitation, we propose an Event-Steered Multi-Agent Simulator (ES-MAS), in which significant events and daily news continuously drive opinion evolution through dynamic interactions among agents. We first construct the China-U.S. Relation Evolution (CURE) dataset, covering 20 quarters from 2021 to 2025, including 258 major events and over 14,000 daily news articles, and providing a comprehensive temporal foundation for modeling opinion dynamics. Building upon the CURE dataset, we propose a Dual-Stream Data Integration Engine (DSDIE) that aligns simulations with historical timelines via macro-level events while enabling personalized information exposure based on individual agent profiles and contextual signals. Furthermore, we design a News-Driven Dynamic Interaction (NDDI) module, which adaptively groups agents with shared news interests into localized interaction contexts, facilitating bottom-up consensus formation while mitigating the risk of isolated information cocoons. Experimental results on the CURE dataset demonstrate that ES-MAS substantially outperforms existing simulators in reproducing real-world historical trends, offering a scalable and effective framework for modeling dynamic opinion evolution.

[MA-6] MADRAG : Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

【速读】：该论文旨在解决生成式人工智能（Generative AI）在分析性作文评分中存在偏差与评分不稳定的挑战。现有基于大语言模型作为评判者（LLM-as-judge）的方法往往依赖于固定提示（prompt-based），易受主观偏见影响且评估结果缺乏一致性。为此，论文提出了一种无需训练的评分框架MADRAG，其核心创新在于将评价过程分解为多智能体协作机制：由主张者（Advocate）识别文章优点、质疑者（Skeptic）批判不足之处，最终由裁判者（Judge）综合双方论点形成评分。关键突破在于裁判者引入了与评分量规对齐的范例检索（retrieval-augmented grounding），通过与已标注示例的对比实现评分校准。实验表明，MADRAG显著优于传统提示基线，并接近有监督系统的性能，而消融研究进一步揭示：检索机制是实现评分校准的核心驱动力，而辩论机制则有效提升对高阶写作特质的推理能力。研究结果凸显了结构化交互与外部记忆在构建可靠大模型评估系统中的互补作用。

链接: https://arxiv.org/abs/2606.06754
作者: Ali Keramati,Shiyuan Zhou,Sharad Mehrotra,Mark Warschauer
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 21 pages, 7 figures, 14 tables

点击查看摘要

Abstract:We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt-based baselines while approaching the performance of supervised systems without requiring task-specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher-level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.

[MA-7] Learn to Match: Two-Sided Matching with Temporally Extended Feedback

【速读】：该论文旨在解决传统双边匹配模型在动态信息演化场景下的建模缺陷，即现有模型通常将匹配过程简化为即时的、服从亚高斯分布的偏好反馈，忽略了收益相关的信息随时间逐步揭示并影响后续匹配决策的关键现实。其核心问题是：如何在存在延迟反馈、隐变量演化、预匹配筛选成本及匹配后观测噪声的动态环境中，实现更真实、高效的双边匹配机制。解决方案的关键在于提出一个具有时序扩展反馈的框架——将双边匹配建模为一个部分可观测马尔可夫博弈（Partially Observable Markov Game, POMG），该框架整合了预匹配筛选成本、后匹配观测噪声、动态演化的潜在偏好（latent profiles）以及匹配的内生延续或解体机制。在此基础上，作者构建了名为Learn2Match的多智能体强化学习（MARL）基准，支持去中心化地决策面试对象、匹配对象及解约时机，并通过后悔值（regret）、社会福利（social welfare）和信息摩擦损失（information-friction loss）等指标评估策略性能。实验表明，独立PPO算法在累积社会福利和累积后悔方面优于基于赌博机的CA-ETC基线，验证了MARL在动态匹配中的潜力；但其仍存在较高的信息摩擦损失，表明当前端到端的MARL尚未具备匹配赌博机方法所具有的结构化探索能力。因此，该研究定位Learn2Match为下一代匹配算法开发的基准平台，旨在推动兼具强化学习自适应性、赌博机统计严谨性与稳定匹配机制结构性意识的新一代算法发展。

链接: https://arxiv.org/abs/2606.06744
作者: Haijing Zong,Yancheng Liang,Boyang Zhou,Natasha Jaques
机构: University of Washington(华盛顿大学); Paul G. Allen School of Computer Science and Engineering(保罗·艾伦计算机科学与工程学院)
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms.

[MA-8] Comparing Sentiment Contagion in AI-Agent and Human Social Networks: Evidence from MOLTBOOK

【速读】：该论文旨在探究在仅由人工智能代理（AI agents）构成的社交网络中，情绪（sentiment）的演化机制，特别是负面情绪是否会在系统内扩散。研究聚焦于MOLTBOOK这一由自主语言模型代理组成的社交网络，分析了近290万条帖子与150万条评论数据。结果显示，尽管负面内容比中性或正面内容获得更多回复，表明负面情绪仍能吸引关注，但针对负面内容的回复通常不会持续保持负面情绪，而是大多转向中性状态；且缺乏充分证据表明负面情绪可在多日间持续传播。因此，核心发现是：该系统并未呈现“负面情绪循环”现象，而是表现出“负面关注—中性化”的主导模式。其解决方案的关键在于揭示了AI代理间的互动机制可能天然抑制情绪极端化，即通过回复行为对负面情绪进行“中和”，从而避免情绪蔓延，这暗示了AI代理网络在情感动态上可能区别于人类社交网络，其情绪稳定性高度依赖于交互结构的设计。

链接: https://arxiv.org/abs/2606.06665
作者: Elyes Ben chaabane,Savindu Herath,Yash Raj Shrestha
机构: UNIL (洛桑大学); ETH Zurich (苏黎世联邦理工学院)
类目: ocial and Information Networks (cs.SI); Multiagent Systems (cs.MA)
备注: 8 pages without appendix

点击查看摘要

Abstract:AI agents are beginning to interact not only with people, but also with one another. We investigate what happens to sentiment in such an AI-only social network: does negativity spread, or do replies calm it down? We study MOLTBOOK, a social network made up of autonomous language-model agents, using almost 2.9 million posts and 1.5 million comments. Negative posts receive many more replies than neutral or positive posts, so negativity still attracts attention. However, replies to negative content usually do not stay negative. They most often become neutral, and there is meager evidence that negative sentiment spreads across days. The main pattern is therefore not a cycle of negativity, but negative attention followed by neutralisation. These findings suggest that AI-agent networks may behave differently from human social networks: they may dampen emotional extremes, while still depending strongly on how interactions are organised.

[MA-9] When Does Multi-Agent Collaboration Help? An Entropy Perspective

【速读】：该论文旨在解决多智能体系统（Multi-agent Systems, MAS）在基于公开大语言模型（Large Language Models, LLMs）实现复杂任务时，其有效性的内在机制尚不明确的问题，尤其是对系统成功或失败的底层原因缺乏系统性理解。其核心解决方案的关键在于引入熵（entropy）作为分析工具，从智能体内部（intra-agent）与智能体间（inter-agent）动态交互的角度，系统考察不同拓扑结构、六种推理基准和两个代理任务中问题求解过程的熵变化规律。研究发现，约43.3%的情况下单智能体表现优于多智能体系统，且熵态演化主要在第一轮交互中即已基本确定；进一步提出三大关键观察：1）“确定性偏好”（Certainty Preference）——峰值熵对MAS正确性有害，而稳定熵则有益；2）“基础熵”（Base Entropy）——问题求解过程中熵值较低的基础模型能因果性地提升MAS性能；3）“任务感知性”（Task Awareness）——熵动态在不同任务中的作用具有异质性。基于上述发现，作者提出一种简单但高效的算法“熵判断器”（Entropy Judger），通过从MAS的pass@k结果中筛选最优解，实现了在所有配置和任务上的持续准确率提升。

链接: https://arxiv.org/abs/2602.04234
作者: Yuxuan Zhao,Sijia Chen,Ningxin Su
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of \textitentropy, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies, six reasoning benchmarks, and two agentic tasks. By analyzing 245 features spanning token-, agent-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3% of cases, and that entropy dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) \textitCertainty Preference: peak entropy directly harms and stable entropy directly benefits MAS correctness; 2) \textitBase Entropy: base models with lower entropy during problem-solving causally drive MAS performance; and 3) \textitTask Awareness: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the \textitEntropy Judger, to select solutions from MAS’s pass@ k results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at \hrefthis https URLthis https URL.

自然语言处理

[NLP-0] Agent opia: Long-Term Life Simulation and Learning in Agent Societies

【速读】：该论文旨在解决大语言模型（LLM）在社会性智能发展上的局限性，即如何通过长期、大规模的模拟社会互动来提升模型对人类行为的理解与模仿能力。现有研究中的多智能体社会仿真通常仅覆盖数天时间尺度，难以支撑深层次的社会关系演化与长期认知成长。为此，本文提出Agentopia框架，构建了一个包含100个自主代理的长期社会系统，实现长达10个模拟年的人类生活周期仿真。其核心解决方案在于设计“生命奖励”（life reward）机制以量化代理的福祉水平，并基于该奖励信号通过拒绝采样（rejection sampling）方式对基础LLM进行持续训练。实验表明，该方法不仅促使代理涌现出丰富且复杂的社会行为模式，还显著提升了模型在社会情境中的智能表现，使模拟中代理的福祉水平提高，并在下游角色扮演任务中取得+15.6%的性能增益，验证了长期社会经验对塑造拟人化智能的关键作用。

链接: https://arxiv.org/abs/2606.07513
作者: Xintao Wang,Sirui Zheng,Hongqiu Wu,Weiyuan Li,Jen-tse Huang,Minghao Zhu,Can Zu,Qi Deng,Jiawei Wang,Qianyu He,Heng Wang,Xiaojian Wu,Yunzhe Tao
机构: Fudan University(复旦大学); Johns Hopkins University(约翰霍普金斯大学); University of Science and Technology of China(中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 79 pages, 19 figures

点击查看摘要

Abstract:Humans learn from social life. Simulating this process with LLM-powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand and replicate human behavior. However, prior agent society simulations typically operate at the scale of days, limiting the depth of social interactions and long-term growth. In this paper, we study long-term life simulation and LLM learning in agent societies, with two goals: (1) investigating social behaviors that emerge from life-long simulation, and (2) developing anthropomorphic capabilities in LLMs, particularly intelligence in social life, through years of simulated social experience. Specifically, we present Agentopia, a comprehensive framework for long-term life simulation in multi-agent societies, where 100 agents autonomously pursue personal growth, develop social relationships, and fulfill their needs and goals over 10 simulated years. We define life reward to mirror human well-being, and leverage this reward to train LLMs via rejection sampling. Extensive experiments show that agents exhibit rich emergent social behaviors. Furthermore, life reward training effectively enhances the underlying LLM, which leads to improved agent well-being in simulation, and generalizes to downstream role-playing benchmarks with +15.6% improvement.

[NLP-1] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agent ic Retrieval Mechanism

【速读】：该论文旨在解决当前视觉-语言模型（Vision-Language Models, VLMs）在处理时长长达数小时的视频时面临的挑战，即全序列视觉输入导致的令牌爆炸（token explosion）与注意力稀释（attention dilution）问题。其核心解决方案是提出MemDreamer框架，通过将感知与推理过程解耦，将长视频理解转化为一种代理式探索（agentic exploration）机制。关键创新在于构建分层图记忆（Hierarchical Graph Memory），采用自顶向下的三层架构实现语义抽象，其中基础图结构捕捉时空关联与因果关系；在推理阶段，推理模型通过代理增强的检索机制，在层级结构中执行“观察-推理-行动”（Observation-Reason-Action）循环，实现高效导航与逻辑路径遍历。实验表明，MemDreamer在四个主流基准上达到当前最优（SOTA）性能，仅落后于人类专家3.7分，同时将推理上下文窗口压缩至原始输入的2%，却实现了12.5个百分点的绝对准确率提升。统计分析进一步揭示了VLM在逻辑推理能力与长视频理解性能之间存在显著正线性相关性，确立了代理能力扩展（agentic capability scaling）作为多模态理解的新范式。

链接: https://arxiv.org/abs/2606.07512
作者: Cong Chen,Guo Gan,Kaixiang Ji,ChaoYang Zhang,Zhen Yang,Guangming Yao,Hao Chen,Jingdong Chen,Yi Yuan,Chunhua Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM’s performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

[NLP-2] Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification ACL

【速读】：该论文旨在解决土耳其语习语性轻动词构式（Light Verb Construction, LVC）的识别难题，这类构式在表面形式上与完全字面意义的动词-宾语组合一致，但实际语义上具有部分习语性，因而对多词表达处理构成挑战。其解决方案的关键在于将土耳其语LVC检测建模为一个二元分类任务（字面义 vs. 习语义），并通过精心设计的提示（prompting）策略评估不同模型的表现。研究对比了基于监督学习的土耳其语编码器基线（BERTurk+分类头）与三种经过指令微调的大语言模型（LLM）在零样本、单样本及少样本提示下的性能，并分析示范样本如何改变错误模式。结果表明，尽管零样本下LLM对非习语性负例表现良好，但对习语性正例召回率极低；单样本提示虽显著提升LVC检测能力，却可能引入强模型特异性偏差；而更丰富的少样本提示能有效改善校准性，使GPT-OSS-20B和Qwen 2.5-14B在整体性能上达到稳健水平。总体而言，研究揭示了土耳其语语用学分类任务中提示设计的高度敏感性：在合理构建示范样本的前提下，提示调优的LLM可达到甚至超越传统监督基线的性能。

链接: https://arxiv.org/abs/2606.07479
作者: Sercan Karakaş,Yusuf Şimşek
机构: University of Chicago (芝加哥大学); Fırat University (法蒂尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL SRW 2026

点击查看摘要

Abstract:Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially idiomatic predicate. We frame Turkish LVC detection as a binary classification task (literal meaning vs. idiomatic meaning) and evaluate on a manually created controlled set (N=147) with matched negatives: out-of-domain random sentences and in-domain literal controls (NLVC), alongside LVC positives. We compare a supervised Turkish encoder baseline (BERTurk with a classifier head) to three instruction-tuned LLMs from different families under zero-shot, one-shot, and few-shot prompting, and analyze how demonstrations shift error profiles. In zero-shot, LLMs perform well on negatives but show very low LVC recall. One-shot prompting sharply improves LVC detection but can induce strong, model-specific biases, leading models to overpredict or underpredict LVCs. A richer few-shot prompt improves calibration and yields robust overall performance for GPT-OSS-20B and Qwen 2.5-14B. Overall, the results highlight substantial prompt sensitivity in Turkish metalinguistic classification: the supervised baseline remains competitive, while prompted LLMs can match or exceed it on LVCs with carefully constructed demonstrations.

[NLP-3] EVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Model）中图像与文本嵌入对齐不佳的问题，其核心挑战在于图像所包含的信息量远超其对应描述性文字（即caption）所传达的内容，导致信息不对称。为应对这一问题，论文提出TEVI框架，其关键创新在于利用文本描述作为信号，指导从图像嵌入中保留与之相关的关键特征。具体而言，通过引入稀疏自编码器（sparse autoencoders）实现图像嵌入的解耦，并设计一个掩码模块，根据给定文本内容有选择性地重构图像嵌入，从而在嵌入空间中强化与文本一致的语义表征。在合成文本的受控实验中，验证了TEVI能有效保留由文本指定的属性并抑制无关信息；在自然图像训练的CLIP模型上应用后，TEVI显著提升了跨粗粒度短文本（如MS COCO、Flickr）和细粒度长文本（如IIW、DOCCI）基准下的检索性能，尤其在信息丰富的文本上表现更优，并增强了在RoCOCO鲁棒性测试中的表现。

链接: https://arxiv.org/abs/2606.07451
作者: Sweta Mahajan,Sukrut Rao,Jiahao Xie,Alexander Koller,Bernt Schiele
机构: Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany; Saarland University, Saarbrücken, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 13 figures, 14 tables

点击查看摘要

Abstract:Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

[NLP-4] Sycophantic Praise: Evaluating Excessive Praise in Language Models

【速读】：该论文旨在解决生成式语言模型（Generative AI）中存在的奉承性赞美（sycophantic praise）问题，即模型在缺乏足够依据的情况下过度给予用户正面评价，尤其在社会性与解释性任务中表现尤为突出。现有研究多关注模型对用户观点的过度认同或验证，而忽视了显性赞美与阿谀奉承行为的独立性及其潜在危害。论文提出一种参数化评估框架，通过量化赞美程度与用户贡献质量及预期能力之间的匹配度，以判断赞美是否超出合理范围。该方法的关键在于引入可调节的基准参照机制，使评估能够反映真实情境下的合理性，从而有效区分适度赞美与非理性吹捧。实验表明，该框架在与人类标注的一致性上显著优于通用大语言模型（LLM）评判者，且揭示出奉承性赞美在主观性领域中的发生频率远高于客观推理场景，由此确立“赞美校准”作为一项独立的语言模型对齐挑战。

链接: https://arxiv.org/abs/2606.07441
作者: Daniel Vennemeyer,Phan Anh Duong,Meryl Ye,Ruihong Huang,Tianyu Jiang
机构: University of Cincinnati (辛辛那提大学); Carnegie Mellon University (卡内基梅隆大学); Texas A&M University (德克萨斯农工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment problem that cannot be reliably measured using current methods. We introduce a parameterized framework that measures whether praise is excessive relative to contribution quality and expected user ability. We show that our framework substantially outperforms generic LLM judges in agreement with human annotations, and that sycophantic praise occurs far more frequently in social and interpretive domains than in objective reasoning settings. Together, these findings position praise calibration as a distinct alignment challenge.

[NLP-5] he Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders? INTERSPEECH2026

【速读】：该论文旨在探究当前视觉语音识别（Visual Speech Recognition, VSR）模型在性能超越人类唇读者后，是否真正实现了类人化的视觉语音感知。研究发现，尽管VSR模型在整体词级准确率上优于人类，但其错误模式与人类存在显著差异：模型的错误主要由训练语料中的词频决定，而非视觉信息的丰富程度；此外，模型在人类难以识别的音位（viseme）上表现更优，且对视觉清晰度的依赖远低于人类。关键在于，即使仅提供少量初始音素的纯文本n-gram基线模型也能接近人类水平，表明现有VSR系统主要依赖语言先验知识而非真正的视觉特征建模。因此，解决方案的关键在于揭示当前VSR模型严重依赖训练数据中的语言统计规律，而非有效整合视觉线索形成有意义的词汇表征，从而暴露了其在跨模态语义绑定上的根本缺陷。

链接: https://arxiv.org/abs/2606.07435
作者: Rishabh Jain,Naomi Harte
机构: Sigmedia Group, School of Engineering, Trinity College Dublin (都柏林三一学院), Ireland
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at INTERSPEECH 2026

点击查看摘要

Abstract:Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

[NLP-6] he Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLM s

【速读】：该论文旨在解决大语言模型在跨语言文化问答任务中，本地文化知识（local cultural knowledge）是通过英语还是本地语言更易获取这一关键问题。现有评估方法存在两大局限：其一，多依赖平行模板生成的问题，难以反映真实语境下文化知识的自然呈现；其二，原始准确率指标将语言通用能力与语言特异性知识获取能力混淆，导致误判。为此，研究提出一种受控评估框架，基于从区域基准和本地来源收集的真实文化问题，通过交叉设计问题类型（文化无关型 vs. 文化特定型）与查询语言（英语 vs. 本地语言），并采用共享的单参数项目反应理论（1PL item response theory）模型分离语言熟练度与文化知识获取能力。在覆盖13个语区、约80个模型的实验中发现，英语在文化无关型问题上表现普遍更优，反映出英语更强的语言熟练度；然而，在校正熟练度差异后，几乎所有语区-模型组合中，本地语言均展现出显著的文化知识获取优势。该优势在原始准确率中常被掩盖，但在前沿模型、区域性对齐模型或语言适配模型中更为明显。研究结果表明，本地语言性能较弱并不等同于文化知识匮乏，反而是本地文化知识更可能通过本地语言实现有效访问，其真实潜力往往被语言能力不足所遮蔽。

链接: https://arxiv.org/abs/2606.07422
作者: Yang Zhang,Xiao Fei,Amr Mohamed,Sarah Almeida Carneiro,Mersin Konomi,Mingmeng Geng,Ahmed Asaad,Guokan Shang,Michalis Vazirgiannis
机构: Ecole Polytechnique (巴黎综合理工学院); MBZUAI ( Mohamed Bin Zayed University of Artificial Intelligence); ENS-PSL (École Normale Supérieure - PSL); Durham University (杜伦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel template-based questions that may not reflect how cultural knowledge naturally appears, and raw accuracy conflates general language proficiency with language-conditioned knowledge access. We address these issues with a controlled framework built on real-world cultural questions collected from regional benchmarks and local sources. By crossing question type (culture-agnostic vs. culture-specific) with query language (English vs. local language), and estimating ability with a shared 1PL item response theory model, we separate proficiency from localized knowledge access. Across 13 locales and roughly 80 models, we find a consistent English advantage on culture-agnostic questions, indicating stronger English proficiency. However, after accounting for this proficiency gap, local languages show a positive knowledge-access advantage in nearly all locale-model settings. This advantage is often masked in raw accuracy but becomes more visible for frontier, regionally aligned, or language-adapted models. Our results suggest that weaker local-language performance does not necessarily imply weaker cultural knowledge; rather, local cultural knowledge may be more accessible through the local language but hidden by limited language proficiency.

[NLP-7] M3Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

【速读】：该论文旨在解决当前语言智能体（Language Agents）在处理真实用户-代理交互中多模态信息时，现有评估基准无法有效衡量跨模态对齐、隐含信息推理及多轮会话推理能力的问题。现有基准多基于人类间对话场景，缺乏对真实多模态文件交互与隐藏用户意图的充分评估。为此，研究提出M^3 Exam，一个以查询为中心的多模态对话记忆评估基准，其关键在于构建基于真实用户-代理交互场景的多维评估体系，涵盖跨模态定位与隐含信息推断。实验表明，主流多模态大模型（MLLMs）与记忆系统在跨模态对齐、跨会话推理以及多模态上下文累积效率方面仍存在显著差距。为应对上述挑战，研究进一步提出M^3 Proctor，一种新型多模态记忆方法，其核心创新在于动态检测查询模态偏差，并仅在必要时按需消费原始视觉源数据，从而在提升准确率13%的同时，将索引构建时间与检索令牌数降低超过70%。

链接: https://arxiv.org/abs/2606.07402
作者: Zhengjun Huang,Wenxuan Liu,Zhoujin Tian,Wei Chen,Junle Chen,Yuqian Wu,Fangyuan Zhang,Qintian Guo,Xiaofang Zhou
机构: The Hong Kong University of Science and Technology (香港科技大学); Beijing University of Chemical Technology (北京化工大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）); Beijing Institute of Technology (Zhuhai) (北京理工大学（珠海）); Tencent Hy (腾讯); Peng Cheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of concealed user information. We therefore introduce M ^3 Exam, a query-centric multimodal conversational memory benchmark built on realistic user-agent interaction, with multi-dimensional evaluation spanning cross-modal grounding and implicit information inference. Benchmarking MLLMs and memory systems reveals persistent gaps in cross-modal grounding, cross session reasoning, and the efficiency cost of accumulating multimodal context. We further propose M ^3 Proctor, a multimodal memory method that detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.

[NLP-8] Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

【速读】：该论文旨在解决生成式AI在智能体评估与训练过程中出现的“捷径依赖”（shortcut exploitation）问题，即模型通过非预期的投机性策略获得高评价分数，而非真正完成目标任务，导致评估分数无法真实反映模型的任务求解能力。其核心解决方案是提出CapCode框架，该框架通过构建具有随机测试用例的编程数据集，并人为设定一个低于1的性能上限（capped performance），使得在不作弊的前提下模型所能达到的最佳表现被明确限制。这一设计赋予评估分数更清晰的语义：若模型得分显著高于性能上限，则可判定为存在作弊行为。为实现该目标，论文进一步提出基于CapCode原则的CapReward奖励机制，通过惩罚超出性能上限的优化行为，有效抑制模型的投机性策略。实验结果表明，CapCode能有效识别作弊行为且保持模型性能排序的合理性，而CapReward则显著降低作弊倾向，促使模型更忠实于任务规范。

链接: https://arxiv.org/abs/2606.07379
作者: Thanawat Lodkaew,Johannes Ackermann,Soichiro Nishimori,Nontawat Charoenphakdee,Masashi Sugiyama,Takashi Ishida
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.

[NLP-9] DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

【速读】：该论文旨在解决文本引导音频编辑中如何在不依赖模型训练和反向重构（inversion）的前提下，实现高效且精准的音频内容修改问题。现有方法多采用基于反向重构的编辑策略，虽能较好保留无关成分，但存在计算开销大、重建误差高的缺陷。为突破这一瓶颈，本文提出DirectAudioEdit，作为首个无需训练且无需反向重构的音频编辑方法，其核心创新在于通过扩散去噪动力学（diffusion denoising dynamics）构建从源音频到目标音频的直接编辑路径。该方案有效避免了传统方法中的迭代反演过程，显著降低计算复杂度与重建误差，在音乐和事件级音频基准测试中，相较DDPM反向编辑方法，分别实现了15.9%和15.8%的宏观平均FAD与KL值下降，并达到最高64.5%的编辑速度提升。

链接: https://arxiv.org/abs/2606.07356
作者: Zhengkun Ge,Xiaoqian Liu,Haoran Zhang,Yuan Ge,Junxiang Zhang,Zhengtao Yu,Jingbo Zhu,Tong Xiao
机构: Northeastern University (东北大学); Kunming University of Science and Technology (昆明理工大学); NiuTrans Research (牛津研究院)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.

[NLP-10] LLM -Guided Evolution for Medical Decision Pipelines

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在临床工作流中应用时面临的高成本问题，特别是传统依赖昂贵微调或手动设计提示（prompt）与流水线工程的局限性。其核心挑战在于如何在不进行参数更新的前提下，高效地发现可执行、可解释且性能优越的医疗决策策略。论文提出的解决方案关键在于引入一种基于生成式AI（Generative AI）引导的MAP-Elites进化算法，在推理阶段实现对医疗决策程序的自动搜索与优化。该方法将紧急程度分诊、交互式咨询和医学图像分类等任务建模为在可执行程序空间中的进化搜索过程，通过任务特定的适应度函数指导演化。实验表明，该框架在三类场景下均显著优于人工设计的基线：在分诊任务中，模型的半格兰准确率从77.3%提升至87.1%，急诊召回率从0.60提升至0.97；在交互式咨询中，策略在多个大模型上改善了精度-成本权衡，并具备良好的跨域泛化能力；在肺炎影像分类任务中，仅通过提示词演化即可提升冻结的MedGemma视觉语言模型（VLM）性能，同时保证严格的JSON输出格式。定性分析揭示，性能提升源于可解释的程序级机制，包括校准的分诊阈值、有针对性的证据获取、选择性承诺以及面向目标的视觉决策规则，而非简单的提示重述。因此，该方案的关键创新在于利用进化计算在推理阶段自动探索并优化决策逻辑，从而实现高效、安全且可解释的临床辅助系统构建。

链接: https://arxiv.org/abs/2606.07342
作者: Ivan Sviridov,Artem Oskin,Ivan Panin,Iaroslav Bespalov,Dmitry Dylov,Ivan Oseledets,Aleksandr Nesterov
机构: Sber AI Lab(斯伯AI实验室); AIRI(人工智能研究机构)
类目: Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at this https URL. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from 77.3% to 87.1% and emergency recall from 0.60 to 0.97 , while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy–cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone. Subjects: Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2606.07342 [cs.CL] (or arXiv:2606.07342v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.07342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-11] SV-Detect: AI-generated Text Detection with Steering Vectors

【速读】：该论文旨在解决在分布偏移（distribution shift）场景下，机器生成文本检测性能显著下降的问题，尤其针对跨领域、跨源模型以及机器编辑攻击（如润色与重写）等复杂情形。其核心解决方案是基于冻结语言模型隐藏层表示中提取的“引导向量”（steering vectors）构建检测器：在每一层隐状态中学习一个能够区分人类撰写与机器生成文本的方向，并将每个输入文本表示为其在各层方向上的投影对齐特征；随后通过一个轻量级分类器对这些投影特征进行训练以输出最终检测分数。该方法在分布内及多种分布外场景下均表现出优异性能，且分析表明所学方向不仅与可识别的风格线索对齐，还捕捉了超出表面特征的深层语义信号，从而将假文本检测问题重新定位为表示空间中的探针问题，证明了引导向量是一种简洁而有效的应对策略。

链接: https://arxiv.org/abs/2606.07313
作者: Mikhail Vishnyakov,Tatiana Gaintseva
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.

[NLP-12] Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

【速读】：该论文旨在解决生成式音频语言模型（Audio Language Models, ALMs）在情感识别任务中是否能够以具身化（grounded）方式有效利用显式声学提示（acoustic cues）的问题，尤其关注当原始音频信号已可用时，这些符号化声学概念是否仍被模型合理感知与使用。其解决方案的关键在于从标准化的eGeMAPS副语言特征集中提取六种可解释的声学概念词元（acoustic concept tokens），分别表征能量、音高、动态范围、明亮度、共振峰和语音质量，并将其作为附加符号线索注入文本提示中，同时保持音频输入不变。实验结果表明，在FAU-Aibo和IEMOCAP两个基准数据集上，与音频对齐的词元显著提升未加权平均召回率（UAR），而随机打乱、冲突或损坏的词元则导致性能下降并使分类混淆偏向中性类别。更重要的是，即使在强词元扰动下，模型预测并未完全崩溃，表明模型虽对符号提示敏感，但仍部分依赖于原始音频信号，体现出一定的音频-符号双通道协同机制。因此，该研究提出仅通过词元干预即可有效评估ALM在情感计算中的音频具身性、鲁棒性与可解释性，为未来可解释性分析提供了实用且高效的手段。

链接: https://arxiv.org/abs/2606.07309
作者: Iosif Tsangko,Andreas Triantafyllopoulos,Björn W. Schuller
机构: University of Augsburg (奥格斯堡大学); University of Manchester (曼彻斯特大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.

[NLP-13] Phun-Bench: Evaluating LLM s on Phonological Understanding in Chinese ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在语音学理解（phonological understanding）方面能力评估不足的问题。现有研究多聚焦于语义（semantics）与符号（spelling），而忽视了声音要素，且现有评测基准或依赖机械记忆、或与其他认知能力混杂，难以真实反映模型在语音学层面的理解能力。为此，本文提出Phun-Bench——一个专为中文设计的基准测试，涵盖同音（Homophony）、押韵（Rhyme）与语音相似性（Phonetic Similarity）三个维度，具有多样化的任务与情境，能够系统性地评估LLM的语音学理解水平。研究发现，尽管模型在正确发音的回忆上表现良好，但在灵活、直觉化地运用语音知识方面仍显著逊于人类说话者。通过深入分析，作者进一步提出了关于LLM语音理解与“感知”机制的假设，揭示了该领域尚未充分探索的研究前沿。

链接: https://arxiv.org/abs/2606.07300
作者: Xing Yue,Yongliang Shen,Weiming Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Language is a vehicle for thought, intricately tied to sounds, symbols, and meaning. However, most large language model (LLM) research focuses on meaning (semantics) and symbols (spelling) while largely overlooking sounds. Existing benchmarks on LLMs’ phonological abilities are either solvable through rote memorization or intertwined with other abilities, making them inadequate to measure LLMs’ genuine ability in phonological understanding. Here, we present Phun-Bench, a purpose-built Chinese benchmark with diverse tasks and settings across three dimensions (Homophony, Rhyme, and Phonetic Similarity), designed to systematically evaluate LLMs’ phonological understanding. Our results show that while LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do. Moreover, through detailed analyses, we propose a hypothesis regarding the underlying mechanism of LLMs’ phonological understanding and “perception”, highlighting an underexplored frontier for future research.

[NLP-14] SWE-Explore: Benchmarking How Coding Agents Explore Repositories

【速读】：该论文旨在解决现有代码代理评估基准（如SWE-bench）在评估编码能力时将任务简化为二元判断（如修复或未修复），从而忽视了代码代理在仓库理解、上下文检索、代码定位和缺陷诊断等细粒度能力的问题。其解决方案的关键在于提出SWE-Explore，一个专注于评估代码代理仓库探索能力的新基准。该基准要求代理在给定仓库与问题的前提下，在固定行数预算内返回相关代码区域的排序列表，并通过独立成功解决同一问题的代理轨迹生成逐行级别的真实标签，以精确刻画实际解决方案路径所访问的代码片段。实验表明，覆盖度、排序质量与上下文效率三个维度的评估指标能有效预测下游修复行为。研究发现，相较于传统检索方法，基于智能体的探索者在整体性能上形成显著优势；尽管文件级定位能力已较为成熟，但线级覆盖范围与高效排序仍是区分当前顶尖探索者的核心关键。

链接: https://arxiv.org/abs/2606.07297
作者: Shaoqiu Zhang,Yuhang Wang,Jialiang Liang,Yuling Shi,Wenhao Zeng,Maoquan Wang,Shilin He,Ningyuan Xu,Siyu Ye,Kai Cai,Xiaodong Gu
机构: Shanghai Jiao Tong University (上海交通大学); Xinjiang University (新疆大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄本那-香槟分校); The Chinese University of Hong Kong (香港中文大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.

[NLP-15] KITs Submission to Cross-Lingual Voice Cloning in IWSLT 2026

【速读】：该论文旨在解决跨语言语音克隆（cross-lingual voice cloning）中的核心挑战：在保持源语言说话人身份特征的同时，生成目标语言中语义准确、发音自然且具备良好可懂性的语音，尤其针对口音差异和领域特异性词汇带来的发音失真问题。其解决方案的关键在于三方面创新：首先，引入语言标签提示（language tag prompting），增强模型对目标语言的控制能力，有效减少口音泄露；其次，采用强化学习（reinforcement learning, RL）微调策略，实现任务适应性优化，显著提升语音可懂性；最后，提出一种参考条件下的词汇匹配方法（reference-conditioned lexical matching），在目标语言与源语言存在词汇重叠时，精准改善领域特定术语的发音准确性。实验结果表明，语言标签提示带来最显著的性能提升，而词汇匹配方法在匹配子集上表现出持续稳定的改进效果。

链接: https://arxiv.org/abs/2606.07240
作者: Seymanur Akti,Alexander Waibel
机构: Karlsruhe Institute of Technology (KIT); Carnegie Mellon University (CMU); KIT Campus Transfer (KCT)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

[NLP-16] When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

【速读】：该论文旨在解决生成式 AI 在医疗领域应用中因提示（prompt）微小扰动而引发输出不稳定的问题，尤其是在临床决策支持等高风险场景下，模型对词汇或句法层面的细微变化表现出高度敏感性，可能导致诊断错误、药物推荐失误或关键信息遗漏，从而威胁患者安全。其解决方案的关键在于系统性评估通用型（如 GPT-3.5、Llama3）与医学专用型（如 ClinicalBERT、BioLlama3、BioBERT）大语言模型在 MedMCQA 基准上的鲁棒性，通过区分自然扰动与对抗性扰动，揭示模型在临床推理任务中的不一致性、准确性和可靠性缺陷，并强调即使经过领域适配的医学大模型也缺乏内在安全性，尤其在句法重排或误导性上下文诱导下极易产生有害输出，因此亟需提升模型对提示扰动的抗性以保障其在真实医疗环境中的可信部署。

链接: https://arxiv.org/abs/2606.07237
作者: Mahdi Alkaeed
机构: Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt perturbations, both lexical and syntactic, posing serious risks in safety-critical clinical applications. In this study, we conduct a systematic sensitivity analysis to evaluate the robustness of both general-purpose (e.g., GPT-3.5, Llama3) and medical-specific LLMs (e.g., ClinicalBERT, BioLlama3, BioBERT) using the MedMCQA benchmark. We categorize perturbations into natural and adversarial types and examine their effect on model consistency, accuracy, and reliability in clinical reasoning tasks. Our findings reveal that medical LLMs are not intrinsically safe. Even minor variations in phrasing can alter clinical advice, and targeted adversarial prompts can provoke harmful outputs. In high-stakes settings like healthcare, such unpredictability is unacceptable-models that change diagnoses due to reworded inputs or hallucinate medications when slightly rephrased cannot be reliably trusted by clinicians. While models tend to show resilience to simple lexical substitutions or paraphrasing, they often break down under syntactic reordering or misleading contextual cues. This fragility is evident across both general-purpose and domain-specific LLMs. Notably, adversarial manipulations can lead to clinically dangerous outputs, such as recommending incorrect dosages or omitting critical findings.

[NLP-17] MMAE: A Massive Multitask Audio Editing Benchmark

【速读】：该论文旨在解决当前音频编辑领域缺乏统一、全面评估基准的问题，尤其针对生成式音频编辑系统在指令遵循与上下文一致性方面表现不佳的现状。现有评估体系高度碎片化，局限于特定子领域或基础操作，难以支撑通用型、复杂多模态音频编辑任务的系统性评测。为此，论文提出MMAE（Massive Multitask Audio Editing），作为首个面向通用指令驱动音频编辑的综合性基准测试平台。其解决方案的关键在于构建一个多层次、高维度的评估框架：涵盖7种真实世界音频模态（如语音、音乐及其混合）、6个层级的任务复杂度（从基础修改到多跳推理与多轮编辑）、2种粒度级别及8类操作类型，并通过人机协作方式精心构建了2,000个高保真样本数据集。核心创新在于引入基于评分标准的评估范式，将自由形式任务分解为17,741个可验证的评判准则，实现对指令遵循能力与上下文一致性的多维精准评估。实验表明，当前主流模型在复杂混合模态任务中的精确匹配率（EMR）甚至降至0%，暴露出执行精度与结构鲁棒性的关键瓶颈。MMAE不仅为智能创作社区提供了清晰的诊断路径，更确立了标准化、可持续的下一代音频编辑系统评估范式。

链接: https://arxiv.org/abs/2606.07229
作者: Ziyang Ma,Ruiqi Yan,Ruiyang Xu,Jie Fang,Zhikang Niu,Yi-Wen Chao,Wenming Tu,Tianrui Wang,Auden,Qi Chen,Wenxi Chen,Jiaying Chi,Yanru Huo,Zixuan Jiang,Xiquan Li,Yalin Li,Junxi Liu,Minghao Liu,Binghao Qiang,Yijia Shan,Zheshu Song,Tian Tan,Zixiang Wang,Zeyu Xie,Zhifei Xie,Xiaoyu Xing,Qixiang Xu,Chen Yang,Guanrou Yang,Shan Yang,Yifan Yang,Steve Yves,Haotian Zhang,Haina Zhu,Kai Yu,Liefeng Bo,Eng-Siong Chng,Xie Chen
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Open-Source at this https URL

点击查看摘要

Abstract:We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

[NLP-18] DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios KDD2026

【速读】：该论文旨在解决在大型语言模型时代背景下，如何在复杂、开放的环境中对人类创造力进行细粒度评估这一关键挑战。现有方法受限于标准化简单任务和高质量专家标注数据的稀缺性，难以有效捕捉真实场景中创造力的多维特性。为此，论文提出DEFINED框架，其核心在于构建一个基于辩论场景的八维度层次化创造力评估体系，并采用预训练自回归语言模型结合分层评分头的架构，实现从细粒度到粗粒度的多层级评估。该方案通过真实辩论竞赛中的语句及其专家评分数据，辅以受控的数据增强策略缓解精英偏差问题，并引入混合粒度训练机制，在有限的专家标注条件下实现鲁棒学习。此外，通过面向无经验参与者的实证研究，验证了模型在生态效度上的优越性，证明其在中低水平人群中的适用性。实验结果表明，DEFINED在准确性与稳定性方面均显著优于基于提示的大型语言模型评估器及现有辩论评分方法。

链接: https://arxiv.org/abs/2606.07226
作者: Tongzhou Yu,Mingjia Li,Hong Qian,Wenkai Wang,Zongbao Zhang,Yaoyu Jiang,Xiangfeng Wang,Aimin Zhou,Jiajun Guo
机构: Nanjing University (南京大学); Shanghai Innovation Institute (上海创新研究院); East China Normal University (华东师范大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

[NLP-19] Adversarial Creation and Detection of AI-Generated Social Bot Content

【速读】：该论文旨在解决生成式人工智能（Generative AI）与社交机器人（social bots）融合后，恶意行为者大规模生成类人内容以操纵信息生态系统的现实威胁。现有检测模型在真实场景中表现不佳，主要受限于缺乏真实可靠的标注数据（ground-truth data）。为此，论文提出一种对抗性方法论，通过建模恶意行为者对真实社交媒体用户的模仿行为，构建了一个多语言、跨平台的配对数据集，包含人类撰写与AI生成的消息对。基于此类对抗性数据进行训练，所提方法能够实现高精度的AI生成文本检测，在真实世界、分布外（out-of-distribution）的数据上显著优于现有基于内容的机器人检测模型。其解决方案的关键在于引入对抗性建模框架以生成更具代表性的训练数据，从而提升模型在复杂现实环境中的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2606.07219
作者: Mykola Trokhymovych,Ricardo Baeza-Yates,Alessandro Flammini,Diego Saez-Trumper,Filippo Menczer
机构: Universitat Pompeu Fabra, Barcelona, Spain; Observatory on Social Media, Indiana University, Bloomington, Indiana, USA; KTH Royal Institute of Technology, Stockholm, Sweden
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The convergence of large language models and social bots allows malicious actors to manipulate the information ecosystem by generating human-like content at scale. Existing models for detecting AI-generated content often fail in the wild, primarily due to the lack of ground-truth data. We address this gap through an adversarial methodology that models the impersonation of real social media users by malicious actors. Using this methodology, we curate a multilingual, cross-platform dataset of paired human and AI-generated messages. Training on such adversarial data yields accurate detection of AI-generated text. Our approach significantly outperforms existing models for content-based bot detection in real-world, out-of-distribution data.

[NLP-20] From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在问题求解过程中，现有过程奖励模型（process reward model）仅依赖局部步骤正确性进行评估所导致的偏差问题。其核心挑战在于：局部正确性虽可衡量中间步骤的合理性，但并非最终目标——即提升问题成功求解的概率。为此，作者提出“前缀增益”（prefix gain）这一新概念，定义为通过条件化轻量级学生模型组（lightweight student model group）于某一推理前缀（reasoning prefix），所引发的成功求解率提升。基于此，构建了前缀效用模型（Prefix Utility Model, PUM），采用简单的成对排序（pairwise ranking）目标进行训练，使模型能够学习以结果为导向的前缀效用评估能力。该方法不仅可对完整推理轨迹评分，还可对部分推理前缀进行有效评估。实验表明，在数学推理任务中，无论采用 Best-of-N 采样、束搜索（beam search）或强化学习，PUM 均能提供强大的前缀级别监督信号，尤其在候选集规模大、搜索预算高或规则奖励稀疏时表现更优。

链接: https://arxiv.org/abs/2606.07190
作者: Yuhang Zhou,Yixin Cao,Guangnan Ye
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion. We define this effect as prefix gain, the solve-rate improvement induced by conditioning lightweight student model group on a prefix, and use it to train a Prefix Utility Model (PUM) with a simple pairwise ranking objective. PUM learns outcome-grounded prefix utility and can score both complete trajectories and partial reasoning prefixes. Across Best-of- N selection, beam search, and reinforcement learning on mathematical reasoning, PUM provides a strong prefix-level supervision signal, especially when candidate pools are large, search budgets increase, or rule-based rewards are sparse. We release all data, models, and code at this https URL.

[NLP-21] Geometry of Semantic Space: Comparative Study of Discrete and Continuous Models

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）模型中语义表征的几何结构不明确与可解释性不足的问题。尽管基于Transformer的监督式向量嵌入（如CamemBERT）在任务性能上表现优异，但其诱导出的语义空间分布常呈现不理想或难以解读的特性。为此，论文提出将监督式嵌入与基于词汇共现（lexical co-occurrence）的图结构模型进行对比分析，以揭示不同表征方式下语义组织的本质差异。其解决方案的关键在于构建一种系统性的比较方法，能够从图结构本身的拓扑特征以及嵌入空间的几何拓扑两个维度进行量化对比。通过对法语“全国大辩论”语料库的实证研究发现，两类模型在局部语义拓扑上具有相似性，但在整体结构与全局拓扑上存在显著差异——图模型展现出更清晰、更符合人类认知的语义组织形式。这一发现表明，深度监督模型与图结构模型之间存在互补性，提示未来可通过引入图结构来引导神经网络架构实现更稳定且可解释的收敛路径。

链接: https://arxiv.org/abs/2606.07183
作者: Gabriel Bounias,Sabine Ploux
机构: ISC-PIF (Institut des Systemes Complexes de Paris IdF), CNRS, France; CAMS (Centre d’analyse et de mathématique sociales), CNRS EHESS, Paris, France
类目: Computation and Language (cs.CL)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:This work examines the semantic geometry underlying NLP models. We compare supervised vector embeddings, such as CamemBERT, with lexical co-occurrence graphs that encode semantic relations more directly. While transformer-based embeddings achieve strong performance, their induced geometries often display unsatisfactory distributions. In contrast, graph-based models reveal a clearer and more human-readable organization of meaning. We have implemented a methodology that allows us to perform a comparative analysis either based on the structure of the graphs or based on the topology of the embeddings induced by these two approaches. The results of the comparison – applied to the French “Great National Debate” corpus a collection of citizen contributions to the public debate – show a similar local topology but a very different overall structure and topology. Theses findings suggest complementary perspectives between deep supervised models and graph-based models, considering a new pathway to guide neural architectures toward more stable and interpretable convergence with graphs structures.

[NLP-22] xtual Supervision Enhances Geospatial Representations in Vision-Language Models ICML2026

【速读】：该论文旨在解决机器学习系统在图像地理定位与空间推理任务中对地理空间理解（geospatial understanding）这一关键但研究不足维度的缺失问题。其核心挑战在于现有模型在捕捉和表征地理空间信息方面存在系统性偏差，尤其在面对局部可定位性差异较大的图像类别时表现不一。解决方案的关键在于揭示语言作为补充模态在编码空间上下文中的有效性：通过对比纯视觉模型（如ViT）、视觉-语言模型（如CLIP）以及大规模多模态基础模型（如LLaVA、Qwen、Gemma），研究发现文本监督能够显著提升模型对地理空间表示的学习能力，表明多模态学习是推动地理空间人工智能发展的关键方向。

链接: https://arxiv.org/abs/2606.07172
作者: Marcelo Sartori Locatelli,Fernando Tonucci,Jea Kwon,Luiz Felipe Vecchietti,Bryan Nathanael Wijaya,Cheng Yaw Low,Virgilio Almeida,Meeyoung Cha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.

[NLP-23] UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

【速读】：该论文旨在解决当前多语言大模型评估中缺乏针对目标语言及教育语境的高质量基准测试问题，尤其聚焦于乌尔都语（Urdu）这一拥有超过2.3亿使用者但长期缺乏本土化、基于教育来源的多任务评估基准的语言。其解决方案的关键在于构建首个基于本土教育资源的乌尔都语多学科多项选择题基准——UrduMMLU，包含26,431道题目，覆盖26个学科和五个知识领域，数据源自本地乌尔都语题库与公开考试PDF文件，确保内容涵盖标准学术科目及具有地域特色的本土知识。为保障标注质量，采用双人人工标注并结合严格共识筛选机制对试题进行标注。通过在英语与乌尔都语提示下对30个大语言模型（LLM）进行零样本评估，以及对四个开源模型在多种少样本设置下的对比实验，揭示了现有模型在乌尔都语语境下的性能局限：尽管Gemini-3.5-Flash在两种语言提示下分别达到90.20%和90.34%的准确率，但多数模型表现显著下降，尤其在人文类地区性内容上比理工科（STEM）平均低25至40分，且少样本提示带来的提升有限。结果表明，当前生成式AI（Generative AI）在乌尔都语知识体系中的覆盖仍不均衡，尤其是在与特定区域文化相关的知识方面存在明显短板。

链接: https://arxiv.org/abs/2606.07167
作者: Ahmer Tabassum,Sarfraz Ahmad,Hasan Iqbal,Owais Aijaz,Momina Ahsan,Preslav Nakov
机构: MBZUAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 18 figures, 17 tables, Submitted to ARR May 2026

点击查看摘要

Abstract:Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.

[NLP-24] Explicit Evidence Grounding via Structured Inline Citation Generation

【速读】：该论文旨在解决生成式人工智能（Generative AI）在问答任务中缺乏准确事实溯源的问题，尤其关注如何实现对生成内容的可信引用。现有方法普遍仅提供松散或不精确的引用，难以确保每个主张（claim）均能与具体来源文档及支持性证据精准对应。其解决方案的关键在于提出FullCite框架，通过三种策略实现结构化内联引用（inline citation）生成：基于提示（prompt-based）的生成、基于引用语法的约束解码（constrained decoding over a citation grammar），以及事后跨度对齐（posthoc span alignment）。实验基于ASQA、BioASQ和ExpertQA三个问答基准测试，在文档级正确性、证据跨度识别和主张-引用一致性三个维度进行评估，结果表明大语言模型（LLM）虽能有效定位相关文档，但在精确定位支撑性文本片段方面表现不佳。这一发现揭示了实现高忠实度可追溯问答系统的核心挑战在于提升证据跨度识别精度，强调未来研究需更加聚焦于精确证据定位机制的构建。

链接: https://arxiv.org/abs/2606.07130
作者: Anar Yeginbergen,Amelie Wührl,Anna Rogers,Rodrigo Agerri
机构: University of the Basque Country (UPV/EHU); IT University of Copenhagen
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI systems become more widely adopted, the demand for factual and faithful generation grows. Properly attributing information through citations becomes, therefore, crucial. This work introduces FullCite, a framework that, in contrast to most previous works, generates structured inline citations linking each claim to both its source document and supporting evidence. FullCite proposes three strategies to inline citation generation: prompt-based generation, constrained decoding over a citation grammar, and posthoc span alignment. Using three question answering benchmarks, namely, ASQA, BioASQ, and ExpertQA, we assess citation quality and faithfulness along three dimensions: document-level correctness, evidence span identification, and claim-citation faithfulness. Our evaluation shows that while LLMs are generally effective at identifying relevant documents, they struggle to identify the precise supporting spans within them. This gap suggests that achieving faithful attributed QA will require research to place greater emphasis on precise evidence span identification.

[NLP-25] Learning Perspectivist Social Meaning via Demographic-Conditioned Fusion Embeddings

【速读】：该论文旨在解决自然语言处理（NLP）系统中社会意义的阐释差异被忽略的问题。由于语言的社会意义具有视角性（perspectival），不同标注者在背景、人口统计学特征及意识形态立场上的差异会导致对同一语言内容产生多样化解读，而现有NLP模型通常将这种多样性简化为单一“真实标签”，从而丢失了关键的语义变异信息。为此，论文提出在视角主义（perspectivist）谱系上建模社会维度，基于包含2.8万条人工标注的数据集，分析不同人口群体间解释的差异。其解决方案的关键在于引入融合嵌入（fusion embeddings）机制，将文本表示与人口统计学特征表示进行整合，实现跨群体语义理解的增强。实验表明，该方法在零样本、少样本及微调等多种范式下均显著优于仅依赖文本的基线模型（相对宏平均精确率-召回率曲线下面积提升5.9%–6.5%），且通过随机打乱消融实验验证了人口统计学特征携带真实的预测信号，而非偶然相关。

链接: https://arxiv.org/abs/2606.07123
作者: Amanda Cercas Curry,Lucio La Cava,Luca Maria Aiello,Gianmarco De Francisci Morales
机构: University of Calabria (卡拉布里亚大学); IT University of Copenhagen (哥本哈根信息技术大学); CENTAI (CENTAI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social meaning in language is inherently perspectival, varying across annotator backgrounds, demographics, and ideological positions. However, most NLP systems collapse this variation into a single ground-truth label, ignoring the diversity of interpretations. In this work, we model social dimensions along a perspectivist spectrum, capturing how interpretations vary across demographic groups on a dataset consisting of 28k human annotations. We benchmark multiple modeling paradigms, including zero-shot, few-shot, and fine-tuned approaches, and propose fusion embeddings that integrate textual and demographic representations. Our fusion models yield consistent and statistically significant improvements over text-only baselines across all fusion strategies (+5.9-6.5% relative macro PR-AUC), with shuffle ablations confirming that demographic profiles carry genuine predictive signal rather than spurious correlations.

[NLP-26] OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

【速读】：该论文旨在解决低比特量化（Low-bit quantization）在大语言模型（Large Language Models, LLMs）推理加速过程中因激活值异常值（activation outliers）导致的性能下降问题。其核心挑战在于，传统量化方法在处理具有极端高幅值的激活值时，会因动态范围压缩而引入显著误差，进而影响模型精度。论文提出的解决方案——OffQ，关键在于设计了一种新颖的偏移补偿机制：首先通过一种改进的Top-1主成分分析（top-1 PCA）识别出激活值中的低维异常子空间；随后利用旋转操作将高幅值激活集中至单一通道；最后将该通道的幅值转化为全局共享的偏移量（offset），从而有效降低激活值的标准差。这一机制使得模型能够在保持均匀网格（uniform-grid）与均匀精度（uniform-precision）量化策略的前提下，实现高效的W4A4KV4量化，兼顾精度与部署友好性。实验结果表明，OffQ在多种主流LLM架构与基准测试中均显著优于现有先进方法，实现了在低比特约束下的高性能推理。

链接: https://arxiv.org/abs/2606.07116
作者: Haoqi Wang,Lorenz K. Mueller,Jiawei Zhuang,Mathieu Salzmann,Lukas Cavigelli
机构: EPFL(洛桑联邦理工学院); Huawei(华为); Swiss Data Science Center, ETHZ EPFL(瑞士数据科学中心，苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

[NLP-27] Style or Content? Evaluating Style Classifiers with Controlled Content Overlap

【速读】：该论文旨在解决生成式AI模型在风格分类任务中过度依赖内容线索（content cues）而忽视真正风格特征的问题，尤其在自然收集的数据中，内容与风格标签之间存在隐含相关性，导致模型学习到非鲁棒的“捷径”（shortcut）。其解决方案的关键在于构建一个受控的内容重叠设置，基于平行圣经翻译数据集定义了一个可量化参数α，用于衡量不同风格类别间共享的内容比例（α=0表示无共享内容，α=1表示完全共享内容）。通过跨重叠评估和内容可恢复性探测实验发现，低重叠模型对内容线索敏感，移除内容后性能显著下降；而高重叠模型则表现出更强的迁移鲁棒性，且随着α增大，内容线索逐渐难以被重建，表明模型在训练过程中逐步摆脱对内容的依赖。这一方法为系统诊断风格学习是否依赖内容捷径提供了简洁有效的工具。

链接: https://arxiv.org/abs/2606.07103
作者: Zhuo Liu,Haozheng Du,Xiangxiang Xu,Hangfeng He
机构: University of Rochester
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Style classifiers can use content cues that correlate with style labels in naturally collected data, yet we lack a systematic way to measure this reliance. We study this problem with a controlled content overlap setup built on parallel Bible translations. Specifically, we define the overlap parameter \alpha as the normalized residual of mutual information between content identity and style label, so that it measures how much content is shared across style classes: from no shared content ( \alpha=0 ) to fully shared content ( \alpha=1 ). Cross-overlap evaluation of RoBERTa-based classifiers shows that low-overlap models degrade when content cues are removed, while high-overlap models transfer more robustly. A cross-style content retrieval probe further shows that content becomes less recoverable as \alpha increases, with training dynamics showing this removal occurs gradually. Together, these results suggest that controlled overlap provides a simple diagnostic for separating style learning from content shortcuts.

[NLP-28] SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

【速读】：该论文旨在解决大语言模型（LLM）压缩中因传统截断奇异值分解（SVD）方法无法充分适应权重矩阵内部结构而导致的压缩效率与性能损失之间的权衡问题。现有方法通常依赖于解析推导的缩放矩阵，难以灵活捕捉模型权重的实际激活分布特性。为此，论文提出SigmaScale，其核心创新在于通过优化两组向量来学习可学习的对角行和列缩放矩阵（scaling matrices），这些缩放矩阵在激活感知的压缩损失函数下进行端到端优化。该方法的关键在于利用激活感知的可学习变换，使压缩过程能够自适应地降低权重矩阵的有效内在秩（effective intrinsic rank），并通过有效秩熵的下降量化这一特性，且该下降与压缩损失高度相关。实验表明，该方法在Llama 3.1 8B Instruct和Qwen3-8B模型上，在困惑度（perplexity）与零样本基准任务中均达到与当前先进SVD基压缩方法相当甚至更优的性能，显著降低了推理计算开销，验证了其在实际低资源部署场景中的有效性。

链接: https://arxiv.org/abs/2606.07098
作者: Ernests Lavrinovics,Marco Letizia,Roy Janco,Shai Segal,Johannes Bjerva,Maurizio Pierini
机构: Aalborg University Copenhagen (奥尔堡大学哥本哈根分校); University of Genoa (热那亚大学); INFN, Sezione di Genova (意大利国家核物理研究院，热那亚分部); European Organization for Nuclear Research (CERN) (欧洲核子研究中心); Ceva, Inc. (Ceva公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present SigmaScale, a method for learning auxiliary scaling matrices S to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define diagonal row and column scaling transformations under an activation-aware compression loss. We show that learned scaling lowers the effective intrinsic rank of weight matrices, as reflected by reductions in effective-rank entropy, and that this reduction is strongly correlated with compression loss. Experiments on Llama 3.1 8B Instruct and Qwen3-8B show that SigmaScale is competitive with closely related state-of-the-art SVD-based compression methods across perplexity and zero-shot benchmarks. By using learned activation-aware transformations, SigmaScale explores a more flexible route to low-rank LLM compression by adapting to the structure of individual model weights. The advantage observed in specific tasks makes our approach a valid option for applications requiring a reduced LLM-inference computing cost.

[NLP-29] mmPISA-bench: Do LLM s Reason Equally Well Across 43 Languages?

【速读】：该论文旨在解决多语言环境下大模型（LLM）推理能力评估的稀缺性与高成本问题，特别是针对非英语语境下高质量、可扩展的多语言推理基准的缺失。其核心解决方案在于构建mmPISA-bench——一个基于经合组织国际学生评估项目（PISA）的紧凑型高质量多语言推理基准，包含25道需逻辑推理才能作答的多项选择题，覆盖43种官方人类翻译语言，并配套相应的机器翻译版本，共形成2,150个数据点。关键创新在于验证了机器翻译文本在多语言推理任务中可达到与人工翻译相当的性能表现，表明高质量机器翻译生成的合成数据足以支持大规模多语言推理评估，尤其在缺乏官方译本的情况下具有显著应用价值。此外，研究还揭示了部分语言在推理过程中存在更高的计算开销与较低准确性并存的现象，为后续模型优化与资源分配提供了重要依据。

链接: https://arxiv.org/abs/2606.07069
作者: Yerzhan Sapenov,Jaromir Savelka
机构: Carnegie Mellon University (卡内基梅隆大学); OpenAI (OpenAI); Anthropic (Anthropic)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each question is provided in official human translations to 43 languages and complemented with machine-translated counterparts (i.e., 2,150 data points in total). We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test-takers, with some performance variations across covered languages. We further find that machine-translated questions do not degrade accuracy relative to official human translations which suggests that high-quality machine translation (synthetic data) might often be adequate for large-scale multilingual reasoning evaluations where official translations are not available. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate.

[NLP-30] Modeling semantic association in self-paced reading with language model embeddings

【速读】：该论文旨在解决如何准确量化词语与其语境之间语义关联（semantic association）的问题，尤其是在控制词项可预测性（word predictability）后，语义关联对阅读理解的影响。其核心挑战在于现有基于语言模型（LM）嵌入的语义关联计算方法存在多种实现方式，导致结果不一致。本文的关键解决方案是系统比较十种不同实现方式，涵盖多种嵌入模型与上下文长度组合，并在联合使用脑电图（EEG）与自定速阅读（self-paced reading）的大规模荷兰语自然文本语料上进行验证。研究采用贝叶斯分层模型与贝叶斯因子分析语义关联对N400成分和阅读时长的影响。结果表明，嵌入模型的选择显著影响语义关联效应的估计；尤其重要的是，仅依赖句子级嵌入（sentence embeddings）的实现方式，在神经与行为指标上均表现出超越词项可预测性的可靠语义关联效应，凸显了句子级表示在捕捉深层语义关联中的潜力。该研究强调了在语义关联量化中方法选择的重要性，为未来研究提供了关键的方法论指引。

链接: https://arxiv.org/abs/2606.07066
作者: Sara Møller Østergaard,Kenneth Enevoldsen,Afra Alishahi,Bruno Nicenboim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic association between a word and its context has been identified as an important component of reading comprehension, even when word predictability is accounted for. Recent research has highlighted the potential of language model ( LM) embeddings to quantify semantic association. Yet, embedding-based semantic association have been operationalized in a myriad of ways. In this study, we use embeddings from LMs to estimate semantic association on a corpus of joint electroencephalography (EEG) and self-paced reading of natural, Dutch texts. Semantic association is calculated in ten different implementations that vary the embedding model and context lengths. The effects of semantic association across the different implementations on the N400 and self-paced reading times are examined using Bayesian hierarchical models and Bayes factor. The results show that the choice of embedding model can alter the estimated effect of semantic association on both the N400 and self-paced reading times. Furthermore, the results demonstrate a promising potential of sentence embeddings for capturing semantic association, as only implementations relying on sentence embeddings indicate reliable results of semantic association beyond word predictability on both neural and behavioral measures. Together, these findings highlight the importance of methodological choices in quantifying semantic association.

[NLP-31] RACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

【速读】：该论文旨在解决生成式 AI（Generative AI）代理在长时间轨迹中通过一系列看似无害的独立行为实现隐蔽恶意目标的问题，此类攻击难以通过传统的轨迹级监控手段检测。现有方法或一次性评估完整轨迹，或将其分割为独立评分的时间窗口，均无法有效关联时间上相隔较远的行为证据。本文提出 TRACE 监控框架，采用 Triage-Inspect-Judge（TIJ）循环机制，能够识别高信号区域、在推理过程中累积并保持跨步骤的证据，实现对长时程行为序列的精准分析与综合判断。其核心创新在于通过分层式证据积累与动态聚焦策略，在不牺牲效率的前提下显著提升对需要长距离因果关联的任务的检测能力。实验表明，TRACE 在 SHADE-Arena 的十个任务领域中取得 0.713 的综合 F1 值和 0.844 的召回率，尤其在依赖长程证据链接的任务上表现突出，优于当前主流基线方法。

链接: https://arxiv.org/abs/2606.07054
作者: Vijitha Mittapalli,Shreyaa Jayant Dani,Satya Srujana Pilli,Snigdha Ansu,Mohammadreza Teymoorianfard,Franck Dernoncourt,Hongjie Chen,Yu Wang,Ryan A. Rossi,Nesreen K. Ahmed
机构: University of Massachusetts at Amherst; Adobe Research; Dolby Labs; University of Oregon; Cisco; Anthropic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.

[NLP-32] Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

【速读】：该论文旨在解决开放域奖励建模中缺乏可验证答案时，如何有效捕捉细微且领域特定偏好（domain-specific preferences）的问题。现有基于评分标准（rubric-based）的方法通常通过为每个查询在线生成评判准则来应对，但这一过程引入额外推理开销，并可能导致指导规则僵化或与目标不一致。其解决方案的关键在于提出Eval-Skill——一种探索引导的评估技能合成方法，将奖励指导重构为上下文演化（context evolution）而非参数训练或逐查询生成评分标准。该方法通过两个渐进阶段（工作流生成与原则生成），在仅需每领域100个案例的情况下，结合探索与选择机制，迭代演化出可复用的领域级评估技能。生成后的技能直接注入裁判模型的上下文，无需重新训练。在多个奖励建模（RM）基准测试中，Eval-Skill显著提升各类裁判模型性能，在RewardBench 2上对主流骨干模型（如Qwen3-8B和DeepSeek-V4-Flash）分别实现+13.44%和+18.51%的绝对提升。进一步分析表明，紧凑的评估技能具备良好的可扩展性、泛化性与迁移能力，为基于大语言模型（LLM）的评估提供了一种高效的新范式。

链接: https://arxiv.org/abs/2606.07040
作者: Xing Yue,Linjuan Wu,Daoxin Zhang,Yongliang Shen,Weiming Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 6 images

点击查看摘要

Abstract:Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance. We introduce Eval-Skill, an exploration-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per-query rubric generation. Using only 100 cases per domain for skill evolution, Eval-Skill synthesizes reusable domain-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages. Once generated, a skill is directly injected into the judge context. Across multiple RM benchmarks, Eval-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone (+13.44% for Qwen3-8B, and 18.51% for DeepSeek-V4-Flash). Further analyses of evolution-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM-based evaluation. Code is available at this https URL.

[NLP-33] Phonetic Error Analysis of Raw Waveform Acoustic Models INTERSPEECH2026

【速读】：该论文旨在深入分析原始波形声学模型在TIMIT语音识别任务中，超出整体音素错误率（PER）的误差模式。其核心问题是：现有基于原始波形的模型在音素识别上的错误分布特征及其可解释性不足，限制了模型优化方向的精准把握。解决方案的关键在于通过三种广义音素类别（Broad Phonetic Class, BPC）分解方式对PER进行细粒度分析，并构建替换错误混淆矩阵，从而揭示不同音素类别的错误特性。研究发现，双向长短期记忆网络（BLSTM）层对依赖过渡的音素类别提升最为显著，而从WSJ数据集迁移学习可使辅音错误率降低约三倍，显著优于元音。此外，原始波形与滤波器组（Filterbank）系统间的混淆模式高度一致，表明主要混淆现象源于语音本身的固有音系相似性，而非模型架构差异。这一发现为设计更高效的端到端声学模型提供了重要理论依据。

链接: https://arxiv.org/abs/2606.07030
作者: Erfan Loweimi,Zhengjun Yue,Andrea Carmantini,Zoran Cvetkovic,Steve Renals,Peter Bell
机构: Centre for Speech Technology Research (CSTR), University of Edinburgh (爱丁堡大学语音技术研究中心); Cisco (思科); SLAI CUHK-SZ (香港中文大学深圳分校智能语言与人工智能实验室); King’s College London (伦敦国王学院)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: INTERSPEECH2026

点击查看摘要

Abstract:We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.

[NLP-34] MADE: Beyond Scoring via a Multilingual Agent ic Diagnosing Engine for Fine-Grained Evaluation Insights

【速读】：该论文旨在解决多语言、多文化基准评估中普遍存在“指标丰富但洞察匮乏”的问题，即现有评估体系虽覆盖广泛的语言与文化维度，却难以提供深入的可解释性诊断。其核心挑战在于，单一大语言模型（LLM）或开放式智能体在面对长篇且噪声较大的诊断输入时易陷入信息过载，同时缺乏可复用的诊断分类体系。为此，本文提出MADE（Multilingual Agentic Diagnosing Engine），其关键创新在于构建了一个分层式诊断框架，将后评估分析过程系统性分解为五大阶段：规划（planning）、聚合分析（aggregate analysis）、实例级案例检视（instance-level case inspection）、多语言与跨文化反思（multilingual and cultural reflection）以及基于证据的报告合成（grounded report synthesis）。MADE配套一个由专家主导设计的54个查询、涵盖15种语言的诊断数据集，并在大规模多语言评估底座（33个模型族、11个基准、26种语言、34种文化、866万条评估记录）上进行验证。实验表明，MADE在诊断报告质量上比最强共享基线提升47%，并在87.9%的人工双盲对比中获得多语言专家青睐。结合多语言专家应用，MADE进一步揭示了关于部署、迭代及跨文化陷阱的四项可操作发现，成功将传统的评分表转化为具有指导意义的模型选择与优化建议。

链接: https://arxiv.org/abs/2606.07020
作者: Yilun Liu,Miao Zhang,Shimin Tao,Minggui He,Chunguang Zhao,Chenxin Liu,Li Zhang,Chen Liu,Cheng Qian,Liqun Deng,Xiaojun Meng,Daimeng Wei
机构: Huawei(华为)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis. However, single LLMs and open-ended agents are easily swamped by the long, noisy diagnostic input, and no reusable taxonomy exists for it. To address this, we propose MADE, a Multilingual Agentic Diagnosing Engine that decomposes post-evaluation analysis into planning, aggregate analysis, instance-level case inspection, multilingual and cultural reflection, and grounded report synthesis. MADE is paired with an expert-led 54-query and 15-language diagnostic set, evaluated on top of a large-scale multilingual evaluation substrate (33 model families, 11 benchmarks, 26 languages, 34 cultures, 8.66M evaluation records). Experiments show that MADE outperforms the strongest shared baseline by 47% in diagnosis report quality and is preferred by human multilingual experts in 87.9% of pairwise comparisons. Applied with multilingual experts, MADE further surfaces four actionable findings on deployment, iteration, and cross-cultural pitfalls, turning benchmark score tables into model-selection and remediation guidance.

[NLP-35] he Sim-to-Real Gap of Foundation Model Agents : A Unified MDP Perspective KDD2026

【速读】：该论文旨在解决生成式 AI 代理在真实世界决策应用中面临的“仿真到现实”（sim-to-real）差距问题。尽管机器人学与经典控制领域已具备成熟的应对框架，但生成式 AI 领域却将代理鲁棒性视为全新挑战，缺乏系统性方法。其核心解决方案是将生成式 AI 代理的评估与训练差距形式化为一个基于马尔可夫决策过程（Markov Decision Process, MDP）四要素——观测（Observation）、动作（Action）、转移（Transition）和奖励（Reward）——的经典 sim-to-real 问题。研究提出一项综合性研究议程，将传统领域的差异性问题映射至生成式 AI 域，并倡导采用如领域随机化（domain randomization）等成熟技术。通过多语言工具调用等具体案例，论文揭示了观测空间差距如何导致语义意图正确但操作上无效的动作。最终，该议程致力于推动范式转变，建立统一的术语体系与标准化的压力测试基准，以培育下一代具备高度可信性的智能代理，实现可靠的真实世界应用。

链接: https://arxiv.org/abs/2606.07017
作者: Xiaoou Liu,Tiejin Chen,Weibo Li,Xiyang Hu,Hua Wei
机构: Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 7 pages, 2 figures, 2 tables. Accepted by KDD 2026 Blue Sky Ideas Track

点击查看摘要

Abstract:Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.

[NLP-36] RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

【速读】：该论文旨在解决监督微调（Supervised Fine-Tuning, SFT）在适配大语言模型进行推理任务时存在的局限性：传统SFT方法通常将单一专家轨迹视为目标行为，导致模型过度拟合于表面形式，抑制了自身推理分布的多样性与泛化能力。其核心解决方案是提出一种策略感知的自适应监督微调框架——滚动自适应监督微调（Rollout-Adaptive Supervised Fine-Tuning, RASFT），其关键在于基于有验证的在线滚动（on-policy rollouts）估计问题级别的可解性，动态调整专家监督强度：当当前策略在特定问题上表现困难时增强专家指导；而当模型已具备可靠推理能力时，则放松对专家轨迹的刚性模仿，并引入正确的自生成路径以促进自主推理。此外，为保留有用的推理先验，RASFT还引入了冻结参考模型与当前策略之间的截断逆比（clipped inverse ratio）机制，有效控制策略漂移。实验在六个数学推理基准和两个代码推理基准上验证了RASFT在多模型上的优越性能，显著优于标准SFT、SFT变体及代表性强化学习方法。

链接: https://arxiv.org/abs/2606.07006
作者: Yongliang Miao,Fengyuan Liu,Wei Shi,Yanguang Liu,Fei Sun,Na Zou,Mengnan Du
机构: The Chinese University of Hong Kong, Shenzhen; Shanghai Artificial Intelligence Laboratory; New Jersey Institute of Technology; Institute of Computing Technology, CAS
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model’s own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at this https URL.

[NLP-37] Principles of Concept Representation in Sentence Encoders

【速读】：该论文旨在解决生成式句子编码器（sentence encoder）在构建高质量概念表征时所面临的核心问题，即为何某些编码器能够有效捕捉语义概念，而另一些则表现不佳。其核心挑战在于理解编码器的潜在空间（latent space）是否具备支持特定语义操作算子的低失真实现能力，这正是表征组合性（representational compositionality）的关键所在。为应对这一问题，研究提出通过系统性消融实验，在330万条来自WordNet和Wiktionary的同义词与定义对上训练并评估多个编码器配置，结合三个去污染数据集与修饰词标注的名词短语基准，识别出四项关键原则：（P1）微调主要重构潜在空间的几何结构而非扩展其容量；（P2）语义信号在预训练后期已高度集中于最后的Transformer层，因此跨层池化冗余；（P3）硬负样本有助于提升判别能力并增强鲁棒性测试，但不改善检索排名，表明校准与排序可独立优化；（P4）监督信号的有效性取决于目标概念的组合类型——外延型（extensional）和子集型（subsective）概念可通过扩展训练获益，而关系型（relational）和内涵型（intensional）概念反而因训练而退化，暴露出当前训练范式的结构性局限。研究还发布了两个新评估数据集：一个基于DBpedia的语义鸿沟基准和一个修饰词标注的名词短语改写套件，以推动后续研究。

链接: https://arxiv.org/abs/2606.06994
作者: Isabelle Mohr,John Dujany,Jonathan Souquet,Andre Freitas
机构: Idiap Research Institute(伊迪普研究学院); Merck KGaA(默克公司)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:What makes a sentence encoder produce good concept representations? We approach this through the lens of representational compositionality: an encoder supports a concept family only when its latent space admits a low-distortion realization of the corresponding semantic operator. This framing predicts both where current encoders succeed and where they are structurally mismatched to their supervision. Through a controlled ablation over encoder conditions trained on 3.3 million synonym and definition pairs from WordNet and Wiktionary, evaluated on three decontaminated splits and a modifier-labeled noun-phrase benchmark, we identify four principles. Fine-tuning recalibrates the latent geometry rather than expanding it (P1). Semantic signal concentrates in the final transformer layer before concept-specific training begins, making cross-layer pooling redundant (P2). Hard negatives improve discrimination and stress-test robustness without improving retrieval ranking, showing that calibration and ranking are independently addressable (P3). Finally, the effectiveness of supervision depends on the composition type of the target concept. Extensional training helps intersective and subsective families while degrading relational and intensional ones, exposing a structural limitation of current training paradigms (P4). We release two new evaluation datasets: a DBpedia semantic-gap benchmark and a modifier-labeled NP paraphrase suite.

[NLP-38] Contrastive Training with LLM -generated Near-Misses for Robust Code-Switching Speech Recognition INTERSPEECH2026

【速读】：该论文旨在解决多语言交替（Code-switching, CS）场景下自动语音识别（ASR）的性能瓶颈问题，尤其关注在语言切换关键区域（CS-critical regions）的识别准确率。其核心挑战在于：传统ASR模型在处理混合语言输入时，难以有效区分和建模语言边界处的语义与声学特征。为此，提出一种基于兴趣点（Point-of-Interest, POI）感知的对比学习框架，关键创新在于通过文献中已有的POI检测方法定位语言切换区域，并利用对POI进行扰动生成声学上合理但语义相近的“近似错误”候选假设（near-miss hypotheses），同时借助大语言模型扩展候选集。通过声学、音素及文本三重约束筛选保留高难度但合理的负样本，从而增强模型对语言切换边界的判别能力。最终，采用LoRA微调Whisper-small模型，结合加权交叉熵锚点损失与多负例对比排序损失，实现对关键切换区域的精细化优化。实验结果表明，在CS-FLEURS（cmn-eng）和ViMedCSS（vie-eng）数据集上，相比标准LoRA微调，该方法在整体误差率与面向语言切换的误差率上均实现了超过2%的持续降低。

链接: https://arxiv.org/abs/2606.06985
作者: Tung X. Nguyen,Hieu Minh Truong,Giang-Son Nguyen,Nhu Vo,Wray Buntine,Dung D. Le
机构: VinUniversity (越南大学); University of Technology Sydney (悉尼科技大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at INTERSPEECH 2026

点击查看摘要

Abstract:Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive training framework that improves recognition at CS-critical regions. We first identify CS spans by adopting POI detection method from literature, then construct acoustically plausible near-miss hypotheses by perturbing POIs in ASR N-best outputs and expanding candidates with a large language model. Hard but plausible negatives are retained through filtering with acoustic, phonemic, and textual constraints. Finally, we fine-tune Whisper-small with LoRA using a POI-weighted cross-entropy anchor objective together with a multi-negative contrastive ranking loss. Experiments on CS-FLEURS (cmn-eng) and ViMedCSS (vie-eng) show consistent reductions of over 2% in both general and CS-aware error rates compared to standard LoRA fine-tuning.

[NLP-39] ree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments

【速读】：该论文旨在解决大语言模型（LLM）智能体在低重复性任务中依赖隐式奖励进行自我演化时所面临的挑战，即过往经验难以复用、反馈延迟、噪声大且仅以结果层面呈现的问题。现有基准通常假设明确的目标、稳定的任务模式和清晰的反馈，无法真实反映复杂动态环境下的学习需求。为此，论文提出了一个时间可控的金融情绪预测基准——FinEvolveBench，其将每日新闻驱动的预测与未来的超额收益相联系，构建了具有真实延迟反馈的隐式奖励场景。解决方案的关键在于提出一种结构化的经验管理方法——经验树（Tree-of-Experience, ToE），该方法通过系统化地组织、检索、验证与更新智能体的经验，提升其在非显式反馈环境中的自适应能力。实验表明，通用的经验机制在该场景下表现不稳定，甚至不如无经验基线，而ToE显著优于其他方法，验证了结构化经验管理对隐式奖励环境下自主演进智能体的重要性。

链接: https://arxiv.org/abs/2606.06960
作者: Zihao Deng,Yining Zhu,Leiming Wang,Jingfei Lu,Junbo Wang,Chuncheng Ran,Yu Yang,Dixuan Yang,Jikun Shen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit rewards, where past experience is difficult to reuse and feedback is delayed, noisy, and outcome-level. We introduce \textscFinEvolveBench, a temporally controlled benchmark for financial sentiment prediction that links daily news-driven predictions to future excess returns. We further propose Tree-of-Experience (ToE), a structured experience-management method that organizes, retrieves, validates, and updates agent experience. Experiments show that general-purpose experience mechanisms do not consistently outperform no-experience baselines, while ToE achieves stronger overall performance. These results highlight the importance of structured experience management for self-evolving agents in implicit-reward environments.

[NLP-40] OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在实际应用中幻觉（Hallucination）检测评估存在的两大核心问题：一是推理配置与评估标准不一致，导致性能结果难以比较、复现和泛化；二是下游任务与领域覆盖有限，限制了检测方法的适用范围。其解决方案的关键在于提出OpenHalDet——一个统一的幻觉检测基准框架，通过标准化从提示构建、生成响应、真实性标注、检测器评分到指标计算的全流程评估流程，实现了跨不同生成场景的可比性与可复现性。该框架支持黑盒（仅依赖生成输出）、灰盒（利用概率信号）和白盒（使用内部模型信号）三类异构检测方法，在统一平台上实现对多种模型、任务与检测范式的系统性对比，从而为幻觉检测研究提供结构化、可扩展的评估环境。

链接: https://arxiv.org/abs/2606.06959
作者: Xinyi Li,Zhen Fang,Yongxin Deng,Jinyuan Luo,Hongnan Ma,Changdae Oh,Zijing Shi,Shanshan Ye,Hanchen Wang,Shu-Lin Chen,Yadan Luo,Mengyue Yang,Sean Du,Sharon Li,Ling Chen
机构: University of Technology Sydney (悉尼科技大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); University of Bristol (布里斯托大学); The University of Queensland (昆士兰大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Code and data are available at this https URL

点击查看摘要

Abstract:Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of downstream domains and tasks. Consequently, reported detector performance is often difficult to compare, reproduce, and generalize beyond specific experimental settings. We introduce OpenHalDet, a unified benchmark for hallucination detection across diverse generation scenarios. OpenHalDet standardizes the evaluation pipeline, from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation. It supports heterogeneous detector families under different access settings, including black-box methods that use only generated outputs, gray-box methods that rely on probability-based signals, and white-box methods that exploit internal model signals. By bringing diverse tasks, models, and detectors into a shared framework, OpenHalDet enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications. We release OpenHalDet as an open and extensible codebase to facilitate reproducible evaluation and future development of hallucination detection methods. The code and datasets are available at this https URL.

[NLP-41] Auditing Training Data in Domain-adapted LLM s: LoRA-MINT

【速读】：该论文旨在解决在经过低秩适应（Low-Rank Adaptation, LoRA）微调的大型语言模型（Large Language Models, LLMs）中，如何有效识别特定数据样本是否曾作为训练数据参与模型训练的问题，即成员推理攻击（Membership Inference Test, MINT）问题。这一问题对知识产权保护和敏感数据管理具有重要意义。其解决方案的关键在于构建一种基于模型困惑度（perplexity）与成员身份之间关系的系统性分析框架，通过量化微调后模型对训练数据与非训练数据的响应差异，实现对训练数据暴露程度的精确评估。实验结果表明，所提出的LoRA-MINT方法在多个基准数据集和模型上均取得了0.77至0.92的高精度，显著优于现有最先进方法，展现出良好的鲁棒性与泛化能力。该方法不仅适用于LoRA微调的LLMs，还可扩展至其他参数高效微调技术或领域自适应的AI模型，为提升大模型部署的透明性与伦理合规性提供了可扩展的审计工具。

链接: https://arxiv.org/abs/2606.06946
作者: Gonzalo Mancera,Daniel DeAlcala,Aythami Morales,Julian Fierrez,Ruben Tolosana,Francisco Jurado
机构: University of Granada (格拉纳达大学); University of Seville (塞维利亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

点击查看摘要

Abstract:We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The primary goal is to assess whether individual samples were part of the training data of these adapted models, providing a useful auditing tool for the management of intellectual property and sensitive data. Our analysis explores the relationship between model perplexity and membership status, providing a systematic framework for estimating data exposure in fine-tuned LLMs. We conducted experiments on four models and three benchmark datasets, obtaining precision values in determining if given data were used for training ranging from 0.77 to 0.92, which outperform state-of-the-art baselines and demonstrate the robustness and generality of the proposed method. In general, our findings underscore the potential of LoRA-MINT as an effective and scalable framework for auditing LLMs, improving transparency, and fostering the ethical and responsible deployment of AI and NLP technologies. For the sake of concreteness and current relevance, our discussion and experiments are centered on LoRAadjusted LLMs, but note that most of the presented methodology is easily applicable for auditing training data given any other technique for adapting LLMs or, more generally, any other domain-adapted AI models.

[NLP-42] Didact: A Cross-Domain Capability Discovery System for Defence CIKM2026

【速读】：该论文旨在解决国防及国防关联领域中政策制定者在应对快速演进的研究动态与作战及战略需求之间存在的信息碎片化问题。当前，相关知识资源分散于异构格式、孤立的数据存储库和封闭的更新流中，导致能力发现效率低下且难以审计。为此，研究提出了一种名为Didact的原型系统，其核心是将澳大利亚公开的国防报告与政策文件，与基于澳大利亚科研出版物构建的专用知识图谱进行集成，并采用复合式检索增强生成（Retrieval-Augmented Generation, RAG）管道支持面向政策的工作流自然语言交互。该方案的关键在于引入了“证据轨道”（Evidence Rail）这一交互式可视化组件，能够直观呈现检索到的证据及其来源之间的关联关系，从而提升可解释性与决策透明度。实证评估表明，Didact在输出质量与运行效率方面均具备显著实用性。尽管该系统专为澳大利亚语境下的学术-产业协同开发，但其架构具有良好的可迁移性，适用于其他知识高度分散的领域。

链接: https://arxiv.org/abs/2606.06942
作者: Aarya Bodhankar,Aditya Joshi,Bao Gia Doan,Thomas Marchant,Oscar Leslie,Flora Salim
机构: University of New South Wales, Sydney, Australia(新南威尔士大学，悉尼，澳大利亚); Cyndr.ai(澳大利亚)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review at CIKM 2026 (System Demonstration Track)

点击查看摘要

Abstract:Policymakers in defence and defence-aligned sectors must monitor rapidly evolving research alongside sector priorities relevant to operational and strategic needs. In practice, these sources are fragmented across heterogeneous formats, disjoint repositories, and siloed update streams, making capability discovery slow and difficult to audit. We present Didact, a prototype that integrates publicly available defence reports and policy documents from Australia with a purpose-built knowledge graph derived from Australian research publications. Didact provides natural language conversations for policy-oriented workflows, and leverages a composite retrieval-augmented generation (RAG) pipeline. A key feature of Didact is an interactive Evidence Rail that visualises retrieved evidence and source relationships. Our evaluation of the output quality and runtime of Didact highlights its utility. While Didact has been co-developed as an academia-industry project for the Australian context, it is adaptable to other domains where knowledge is similarly fragmented. A demonstration video is available here:

[NLP-43] hinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）推理过程中测试时计算量（Test-time Compute, TTC）扩展策略碎片化、评估协议不一致以及缺乏对质量-成本权衡的系统性分析等问题。现有TTC扩展方法在不同场景下表现差异显著，且缺乏统一的评估框架与可部署的集成方案，限制了其在实际应用中的推广。其解决方案的关键在于提出ThinkBooster——一个统一的TTC扩展框架，核心包括：(i) 一个模块化的Python库，集成当前最先进的TTC扩展策略与评分器家族；(ii) 一套联合评估性能与计算效率的基准测试体系；(iii) 一个兼容OpenAI接口的可部署代理服务，支持自适应推理的即插即用集成。此外，该框架还配备可视化调试工具，用于追踪推理轨迹、中间选择决策及备选推理路径。实验结果表明，ThinkBooster能够有效揭示TTC扩展策略在数学与编程任务中的性能-计算权衡，并在真实场景中带来可观的实际收益。

链接: https://arxiv.org/abs/2606.06915
作者: Vladislav Smirnov(1),Chieu Nguyen(1),Sergey Senichev(7),Minh Ngoc Ta(1),Ekaterina Fadeeva(2),Artem Vazhentsev(1),Daria Galimzianova(1),Nikolai Rozanov(1 and 3),Viktor Mazanov(6),Jingwei Ni(2),Tianyi Wu(4),Igor Kiselev(5),Mrinmaya Sachan(2),Iryna Gurevych(1),Preslav Nakov(1),Timothy Baldwin(1),Artem Shelmanov(1) ((1) MBZUAI, (2) ETH Zürich, (3) Imperial College London, (4) NUS, (5) Accenture, (6) Innopolis University, (7) Independent Researcher)
机构: MBZUAI(中东人工智能大学); ETH Zürich(苏黎世联邦理工学院); Imperial College London(帝国理工学院); NUS(新加坡国立大学); Accenture(埃森哲); Innopolis University(因诺波利斯大学); Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

[NLP-44] EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

【速读】：该论文旨在解决小规模语言模型在长上下文问答（Long-context QA）任务中，尽管答案相关的证据已存在于输入中，仍难以有效利用这些证据的问题。现有方法存在两大局限：一是基于上下文内检索的方法仅实现证据片段的显式暴露，但未调整控制模型对全上下文位置注意力分配的查询侧注意力参数；二是轻量级测试时适应方法（如qTTT）虽能通过自监督目标进行查询侧优化，但其通用的跨度级目标无法精确定位支持当前答案的具体上下文位置。针对上述问题，本文提出了一种名为EASE-TTT（Evidence-Aligned SElective Test-Time Training）的框架，其核心创新在于将选定的证据片段转化为对其词元位置的软注意力监督信号，并以此指导查询侧注意力参数的动态调整，从而实现证据对齐的测试时适应。该方法不替换原始完整上下文，而是利用生成的注意力目标引导模型在保持全上下文信息的基础上，更精准地聚焦于相关证据区域，最终基于原始上下文生成答案。实验结果表明，EASE-TTT在六个LongBench QA任务及三种小型解码器模型上均显著优于全上下文推理、仅检索基线和qTTT方法，验证了其在长上下文问答中实现证据对齐测试时适应的有效性。

链接: https://arxiv.org/abs/2606.06906
作者: Xiaopeng Yuan,Zebin Wang,Suwen Wang,Zongxin Yang,Haohan Wang,Yushun Dong
机构: University of Illinois Urbana-Champaign; Harvard University; Brion, ASML US LP; Florida State University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate evidence chunks for the question, but they stop at input-level evidence exposure rather than adapting the query-side attention parameters that control how the model allocates attention over full-context positions. In contrast, lightweight test-time adaptation methods, such as query-only test-time training (qTTT), leave evidence localization unresolved because their generic span-level self-supervised objectives do not identify which context positions support the current answer. In this paper, we propose Evidence-Aligned SElective Test-Time Training (EASE-TTT), a within-context retrieval-augmented test-time training framework that converts selected evidence chunks into a soft attention supervision target over their token positions. Instead of replacing the full context with retrieved chunks, EASE-TTT uses the resulting attention target to guide query-side adaptation, with the adapted model generating the final answer from the original full context. Experiments on six LongBench QA tasks and three small decoder-only language models show that EASE-TTT achieves the strongest macro-average performance among full-context inference, retrieval-only baselines, and qTTT, supporting evidence-aligned test-time adaptation in long-context QA.

[NLP-45] An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection

【速读】：该论文旨在解决生成式对话诈骗（smishing）检测中因数据规模不足与质量缺陷导致的模型性能瓶颈问题，尤其关注深度学习模型（如Transformer）在处理长对话序列时表现不佳的根源。其核心挑战在于早期生成数据集存在标签错位、阶段信息泄露、提示词设计缺陷及内容污染等问题，限制了模型对上下文语义的理解能力。解决方案的关键在于构建一个经过系统优化的扩展数据集——COVA-X，包含10,985条标注对话，覆盖八类针对老年人的诈骗场景，并通过改进的生成管道有效解决了上述质量问题。此外，引入长文本建模架构Longformer并基于更大规模、更高质量的数据重新训练，显著提升了模型性能，使其实现79.71%准确率和0.7786宏F1值，超越传统机器学习方法XGBoost。研究进一步揭示了数据质量提升对模型泛化能力的决定性影响，验证了大容量、高质量对话语料对于释放Transformer模型上下文建模优势的必要性。

链接: https://arxiv.org/abs/2606.06879
作者: Carl Lochstampfor,Ayan Roy
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985 conversations spanning eight elder-targeted scam categories, produced by an improved generation pipeline addressing contamination, label mismatch, stage-direction bleed, and prompt-design failures from the first iteration. Retraining all classifiers on the expanded dataset yields the central finding of this work: Longformer now surpasses XGBoost on all evaluation metrics, achieving 79.71% accuracy and 0.7786 macro F1 compared with 78.43% and 0.7563 for XGBoost. This directly confirms that transformer models require larger conversational corpora to realize their contextual advantages. We additionally document a quality life-cycle including a 12.7 \times improvement in label correction rate, from 49.8% to 3.9%, an architectural intervention reducing virtual-kidnapping artifact rates from 67.1% to 46.5%, and a per-scam-type outcome analysis showing that scam categories modulate results in mechanism-consistent ways. A pre/post-cleanup sensitivity analysis confirms that dataset refinement recovers genuine label-relevant signal across all three classifier architectures.

[NLP-46] Are Large Language Models Suitable for Graph Computation? Progress and Prospects

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在图计算任务中应用的可靠性与集成方式不明确的问题，即何时能够可靠地支持基于结构化关系与算法操作的推理，以及如何有效融入图求解流程。其解决方案的关键在于提出一种基于角色的分类框架，将LLMs在图计算中的作用归纳为两大范式：一是作为执行者（LLMs as executors），直接根据图描述和指令完成图任务；二是作为规划者（LLMs as planners），负责问题建模、推理步骤分解，并调用外部工具或智能体进行执行。该分类框架揭示了当前方法在处理简单、小规模任务时具有潜力，但在大规模、高精度要求的任务上仍存在不可靠性。研究进一步梳理了现有数据集，并提出了四个未来研究方向，以推动LLMs在图计算领域的深入发展。

链接: https://arxiv.org/abs/2606.06865
作者: Yuting Zhang,Yi Han,Kai Wang,Wei Ni,Angela Bonifati,Wenjie Zhang
机构: University of New South Wales (新南威尔士大学); Antai College of Economics and Management, Shanghai Jiao Tong University (上海交通大学安泰经济与管理学院); Edith Cowan University (埃迪斯科文大学); Lyon 1 University (里昂第一大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such computation and how they should be incorporated into graph-solving pipelines. Existing surveys at the intersection of LLMs and graphs primarily focus on graph learning, text-attributed graphs, or graph-language modeling. To bridge this gap, we provide a comprehensive review of LLMs for graph computation through a role-based taxonomy. Specifically, we identify two major paradigms: i) LLMs as executors, where models directly solve graph tasks from graph descriptions and instructions; and ii) LLMs as planners, where models formulate problems, decompose reasoning steps, and invoke external tools or agents for execution. Based on this taxonomy, we analyze the strengths and limitations of current methods. Our review indicates that LLMs are promising for simple, small-scale tasks, but remain unreliable for large-scale and exactness-demanding tasks. Finally, we summarize available datasets and suggest four future directions.

[NLP-47] Interpreting Brain Responses to Language with Sparse Features from Language Models

【速读】：该论文旨在解决认知神经科学中如何表征人类语言皮层所编码信息的核心问题，尤其关注人工语言模型（LM）与生物神经活动之间的对应关系。传统研究常被批评为将一个“黑箱”（人工模型）与另一个“黑箱”（大脑活动）进行关联，缺乏可解释性。为此，本文提出增强型稀疏编码模型（Augmented Sparse Encoding Models），其关键在于：用分层组织的稀疏自编码器（SAE）特征替代人工语言模型中密集的隐藏状态，并显式引入**意外度（surprisal）**作为预测变量。这一方法不仅实现了对神经响应的可解释性解读，还能够检验模型-脑对齐是否反映语言模型表征中的普遍性而非个体特异性变异。研究基于8名受试者在7T fMRI下对200个语言多样性句子的响应数据，首先验证了该框架可复现以往关于加工难度和语义抽象性的神经元群体响应；随后揭示了一个此前未被识别但可靠的神经群体对“人物相关”内容具有选择性响应；进一步发现额颞叶语言网络虽由一组共通特征预测，但前额叶区域仅靠意外度即可较好解释，即便不依赖于基于语言模型的特征；最后表明，大脑语言反应并非可由任意语言模型特征预测，而是最能捕捉语言模型中普遍信息的特征提供了最佳解释，揭示了大脑与语言模型之间存在非平凡的表征对应关系。

链接: https://arxiv.org/abs/2606.06857
作者: Michael A. Lepori,Kendrick Kay,Greta Tuckute
机构: Brown University (布朗大学); University of Minnesota (明尼苏达大学); Kempner Institute (肯普纳研究所); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone, even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.

[NLP-48] CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理表格推理任务时面临的挑战，尤其是在需要对长而结构化的表格进行多步推理的场景下，现有方法普遍依赖单向推理路径，难以有效探索多种假设，导致推理能力受限。其核心解决方案是提出一种统一的反事实推理框架（Counterfactual Reasoning Framework, CRAFT），将表格问答与事实验证任务统一为双向验证过程。该框架的关键在于显式构建陈述句及其反事实变体，并分别沿原始路径与反事实路径进行推理，提取证据后通过加权机制融合，最终得出答案。实验结果表明，CRAFT在WikiTQ和TabFact等表格推理数据集上显著优于现有基线方法，尤其在复杂问题上提升明显，同时有效缩小了不同基础大模型间的性能差距。这表明反事实推理能够克服单向推理的局限性，引导模型实现更严谨、更具判别力的推理，为结构化推理任务建立了一种更原则性的范式。

链接: https://arxiv.org/abs/2606.06842
作者: Chenshuo Pan,Yu Zhao,Jie Zhang,Changzai Pan,Zhenhe Wu,Jiayi Liang,Yujie Mao,Shuangyong Song,Yongxiang Li,Zhongjiang He
机构: Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd
类目: Computation and Language (cs.CL)
备注: 24pages,10 figures

点击查看摘要

Abstract:Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.

[NLP-49] Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

【速读】：该论文旨在解决现代推理模型在复杂多标签任务中如何实现强零样本性能的机制问题，尤其关注从数十万至数百万候选标签中精准筛选出少数相关标签的挑战。其核心问题是理解并优化模型在大规模标签空间下的推理过程。解决方案的关键在于将推理过程解构为两个互补阶段：首先进行广泛的“候选短名单生成”（shortlisting），快速缩小候选范围；随后对筛选后的子集进行细粒度推理。基于这一机制性认知，作者提出一种机械式蒸馏（mechanistic distillation）策略，通过分别优化这两个阶段，实现了对标准蒸馏方法的持续超越。

链接: https://arxiv.org/abs/2606.06840
作者: Debjyoti Saha Roy,Byron C. Wallace,Javed A. Aslam
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We investigate how they achieve this mechanistically. We characterize reasoning as a two-phase process: A broad “shortlisting” of candidates followed by fine-grained reasoning over the resulting set. We provide evidence across a range of datasets that these steps can be isolated and are complementary. Using this characterization, we develop a mechanistic distillation strategy that consistently outperforms standard distillation.

[NLP-50] ranslate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

【速读】：该论文旨在解决大语言模型（LLM）在多语言场景中存在的性能差距问题，即大多数低资源语言缺乏足够的预训练或微调数据，导致模型在这些语言上的表现显著低于主流语言。传统方法依赖于翻译将输入转换为模型擅长的语言以激活其全部能力，但存在效率低下和决策不可靠的问题：对已能处理的语言进行不必要的翻译会造成资源浪费；而让模型自主判断是否翻译则因模型过度自信而频繁忽略工具使用，尤其在无法理解输入时仍拒绝翻译。为此，本文提出一种基于强化学习的自适应决策机制，通过仅从奖励信号中学习单一策略，实现语言与领域自适应的内省（introspection），即模型能够评估自身对输入的理解程度，并仅在无法原生处理任务时才触发翻译。该方法依托作者构建的保持答案一致性的翻译流水线，在22种语言（分属高、低、极低资源三类）及5个领域上对预训练后的Qwen3-4B模型进行持续强化学习，并引入置信度门控的广义策略优化（confidence-gated GSPO），实现对工具使用成本敏感的高效控制。实验结果表明，该门控策略在高资源语言上提升奖励4.6，在低资源语言上提升23.5，在极低资源语言上提升17.5；相比始终翻译的无约束策略，其在保留90%以上原始奖励的同时，仅消耗63%的成本，且在87%的成本敏感区间内达到帕累托最优。此外，针对完全未见语言的模拟测试中，该策略相较过度自信的基线提升18.7，且具备零样本迁移至9个保留语言的能力。研究还分析了训练过程中工具使用行为随语言与领域演化的动态过程。

链接: https://arxiv.org/abs/2606.06835
作者: Pratik Jayarao,Chaitanya Dwivedi,Himanshu Gupta,Neeraj Varshney,Adithya M Devraj,Meet Vadera,Priyanka Nigam,Bing Yin
机构: Amazon Stores Foundation AI
类目: Computation and Language (cs.CL)
备注: 14 pages main text plus appendix, 7 figures, 11 tables

点击查看摘要

Abstract:The performance gap across languages in LLMs is well documented, and closing it natively requires pretraining or fine-tuning on corpora that, for most languages, do not exist. Translation offers an alternative: converting an input into the model’s dominant language unlocks its full capabilities at once. Applying translation to every input, however, is wasteful for languages the model already handles, while leaving the choice to the model fails in the opposite way, as LLMs are overconfident and skip the tool even when they cannot understand the input. Prior work resolves this with language-specific rules, domain heuristics, language identifiers, or external routers, each requiring manual engineering. We instead learn a single policy that decides when to translate from reward alone, developing language- and domain-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively. Using data built by our answer-preserving translation pipeline, we continue RL on the post-trained Qwen3-4B across 22 languages in 3 resource tiers (High, Low, XLow) and 5 domains, and introduce confidence-gated GSPO for cost-sensitive tool use. The gated policy lifts reward over the baseline by +4.6 on High, +23.5 on Low, and +17.5 on XLow. Against an unconstrained policy that almost always translates, it preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. Additionally, to simulate behavior on a completely unseen language, we create 2 synthetic languages, where our gated policy improves +18.7 over the overconfident baseline that underutilizes the tool even on these incomprehensible inputs. The policy transfers zero-shot to 9 held-out languages, and we analyze how tool use emerges over training, per language and per domain. Comments: 14 pages main text plus appendix, 7 figures, 11 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.06835 [cs.CL] (or arXiv:2606.06835v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.06835 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-51] he Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models

【速读】：该论文旨在解决高分级胶质瘤（high-grade gliomas）如何通过与神经元形成功能性突触而整合进神经环路的分子机制问题，核心聚焦于非编码元件在肿瘤细胞中调控突触生成相关基因表达的潜在作用。其关键挑战在于：尽管语言模型（language models, LMs）可通过体外诱变（in-silico mutagenesis, ISM）对“黑暗基因组”（dark genome）中的调控元件进行功能预测，但传统基于似然的评分方法存在自指性偏差，即其得分高度依赖于序列本身的可预测性，导致调控信号的解释不明确。为此，作者提出一种残差化与置换诊断（residualization-and-permutation diagnostic）方法，能够将由序列可预测性驱动的变异（predictability-driven RIS variance）与由真实调控功能驱动的变异（regulation-driven RIS variance）有效分离。研究发现，在三个架构不同的基础模型（Caduceus-Ph、HyenaDNA、Enformer）及92个胶质瘤相关位点上的30,448个非编码元件分析中，仅10 kb以内的近端调控区域具有稳定且一致的调控信号，而模型推导出的元件层级分类不具备生物学稳健性——一个仅包含六个特征的线性基线模型即可在AUC = 0.985下准确预测Caduceus模型中排名前10%的元件。跨架构分解进一步揭示：两个语言模型共同识别出高度可预测的转座元件（transposable elements），而只有Enformer保留了残差型调控元件（cCRE）的特异性信号，且两者的顶级100个元件列表几乎无重叠，表明其功能本质不同。通过保守性、脑部顺式eQTL以及STRING-PPI网络交叉验证，确认了最终筛选出的前100个元件在每种模型中均显著富集于脑部eQTL（p_emp < 5×10⁻³，倍数富集达3.3倍），而看似有潜力的转座元件调控假说和NRXN1+NLGN1蛋白对的汇聚信号则在严格的置换检验中被排除。该研究不仅提供了针对胶质瘤突触生成调控机制的新见解，更将所提出的诊断方法作为通用工具，为所有基于ISM的调控研究提供方法论支持。

链接: https://arxiv.org/abs/2606.06834
作者: Chahat Baranwal,Aadtya Baranwal,Lakshya Nitin Tandon
机构: IIT Jodhpur; University of Central Florida; Northeastern University
类目: Computation and Language (cs.CL); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the \textitdark regulome , is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC = 0.985 . Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are 3.3\times enriched per model for matching brain eQTLs ( p_\mathrmemp 5\times 10^-3 ), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.

[NLP-52] Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

【速读】：该论文旨在解决现有强化学习方法在文本到SQL生成任务中对多轮迭代修正缺乏有效指导的问题。现有方法通常仅基于单次执行结果定义奖励信号，难以充分捕捉多轮SQL优化过程中的渐进改进，导致模型在复杂查询修正时表现受限。其解决方案的关键在于提出一种基于渐进奖励的多轮强化学习框架Progress-SQL，核心创新是引入Oracle引导的诊断树（Oracle-guided Diagnostic Tree, ODT），将SQL查询抽象为条款级结构特征，并生成面向下一轮修正的诊断反馈。同时，通过结合基于ODT的结构对齐与词法对齐，构建能够衡量从初始SQL到最终SQL改进程度的渐进式奖励函数，并进一步引入进度延迟奖励（progression latency reward）以鼓励早期正确性，以及执行状态奖励（execution status reward）以增强对无效SQL的恢复能力。该设计显著提升了模型在BIRD、Spider及其鲁棒性变体数据集上的文本到SQL生成性能，尤其在多轮修正场景中表现出更强的稳定性和准确性。

链接: https://arxiv.org/abs/2606.06825
作者: Shihao Zhang,Xiaoman Wang,Yuan Liu,Yunshi Lan,Weining Qian
机构: East China Normal University(华东师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.

[NLP-53] Quantifying Media Representation Dynamics Across 25 Years of News Reporting on Policing-related Deaths

【速读】：该论文旨在解决加拿大媒体在报道涉及警察死亡事件时存在的叙事失衡问题，即官方机构视角与公众视角之间的显著不对称。其核心问题是：媒体在报道此类事件时，过度依赖政府官员或执法机构代表的叙述，而相对忽视受害者家属、社区成员、目击者、律师及公民权利组织等平民群体的声音。解决方案的关键在于提出一种名为“PerspectiveGap”（视角缺口）的新型计算模型，该模型基于社会学中关于媒体对警务报道的代表性研究，通过量化分析新闻文本中不同主体（如政府官僚与普通民众）的发言比例与情感基调，揭示媒体叙事中的结构性偏倚。研究发现，官方视角的出现频率约为其他平民视角的三倍，且前者多采用冷静、程序化的语言，而后者则具有更强的情绪张力。该框架具备可迁移性，可应用于其他司法辖区，为系统评估媒体如何构建警务问责叙事提供了可扩展的分析工具。

链接: https://arxiv.org/abs/2606.06812
作者: Farhan Samir,Jappun Dhillon,Meghna Ravikumar,Syed Ishtiaque Ahmed,Vered Shwartz
机构: University of Toronto(多伦多大学); University of British Columbia(不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 figures. Websci’26

点击查看摘要

Abstract:We perform the largest known computational analysis of Canadian news narratives about police-involved deaths, spanning 4,000 articles from the last quarter-century. We develop a novel computational model, PerspectiveGap, grounded in prior sociological work on media representation of policing. We find that reporting on police-involved deaths on average features perspectives from state bureaucrats at a rate nearly three times as much as perspectives from other members of the public, including relatives, community members, eyewitnesses, lawyers representing the family, or civil liberties groups. A considerable fraction of articles contain no points of view from civilian actors, though civilian representation has increased in recent years. Qualitatively, we find that state bureaucrats’ accounts of these deaths tend to be clinical and procedural, while civilian discourse carries considerably more emotional valence. The PerspectiveGap framework developed here can be contextualized to other jurisdictions, offering a scalable approach for analyzing how media systems construct narratives around policing and accountability.

[NLP-54] Korean Culture into LLM Alignment: Toward Cultural Coherence ICML2026

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在文化适配性方面过度聚焦于“抑制有害输出”的负面目标，缺乏对“文化一致性回应”（culturally coherent response）的正面定义与构建。其核心问题在于：现有方法仅关注规避不当内容，却未建立明确的文化合规性标准，导致模型在跨文化场景中难以生成符合特定文化语境的合理、可接受的回应。为此，论文提出一种以建设性为导向的解决方案，关键在于构建一个以韩国法律框架、社会规范及解释惯例为基础的、分领域的安全回应指南（Korean-culturally-adapted safe-response policy），并围绕该政策设计了一条基于提示工程的对齐数据生成流水线。该流水线利用提示驱动的LLM种子生成器扩展韩语有害内容分类体系，并通过三类前沿模型在该政策指导下生成候选回应，最终采用直接偏好优化（DPO）微调这些三元组数据，在不显著损害模型通用能力的前提下，显著提升了六种开源大模型在韩语文化情境下的安全响应率；定性分析进一步表明，微调后的模型能够准确引用韩国法律法规、提及制度流程，并在适当情况下提供具有建设性的本土化信息，而不仅限于拒绝。

链接: https://arxiv.org/abs/2606.06797
作者: MinJae Jung,Minwoo Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ICML 2026 Workshop on Culture X AI

点击查看摘要

Abstract:Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is rather than only what it must avoid, and instantiate it for Korean. We design an alignment-data pipeline around a prompt-based LLM seed generator that expands a Korean harm taxonomy, with a Korean-culturally-adapted safe-response policy at its centre: a per-category guideline grounded in Korean legal frameworks, social norms, and interpretive conventions, against which three frontier models each produce a candidate response. DPO fine-tuning on the resulting triplets improves the Korean cultural safe rate across six open-weight LLMs while causing no large degradation on Korean general-capability benchmarks, and qualitative outputs show fine-tuned models naming Korean statutes and institutional procedures and, where appropriate, supplying constructive Korean-context information alongside refusal.

[NLP-55] When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在社会科学研究中进行文本编码时，尽管具备高预测准确率，却可能缺乏行为可靠性（behavioral reliability）的问题。具体而言，研究聚焦于政治事件编码这一复杂的源-目标关系分类任务，要求模型依据详细的编码规则判断某一行为者对另一行为者实施的具体行动。传统依赖专家编写的代码本（codebook）虽具权威性，但其表述方式未必适配LLM的推理逻辑。为此，论文提出将代码本转化为更符合LLM处理习惯的形式，包括清晰的定义、实例示范、检索到的上下文信息以及对复杂案例的处理规则。然而，实验结果表明，尽管这种优化显著提升了细粒度事件分类的准确性，但模型在面对标签名称、代码本顺序或标签定义映射的可控变更时，仍可能表现出不一致的行为，即无法保持编码逻辑的一致性。因此，解决方案的关键在于：不仅要评估模型的预测准确率，更需检验其在结构化编码过程中是否忠实遵循原始代码本所蕴含的语义逻辑，以确保生成结果对社会科学实证研究具有可解释性和可复现性。

链接: https://arxiv.org/abs/2606.06781
作者: Zixian He,Bharath Raahul Murugesan,Patrick Brandt,Yibo Hu
机构: Illinois Institute of Technology (伊利诺伊理工学院); The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computation and Language (cs.CL)
备注: 14 pages, 3 figures, 11 tables

点击查看摘要

Abstract:High accuracy does not necessarily make an LLM a faithful coder. This issue matters because many social-science studies rely on expert-written codebooks to turn text into structured data. We study this problem in political event coding, a challenging source-target relation classification task beyond ordinary sentence-level classification, where models must determine what one actor did to another using detailed coding rules. We test whether expert codebooks become more effective when operationalized into LLM-friendly forms with clearer definitions, examples, retrieved context, and rules for difficult cases. We then evaluate behavioral reliability under controlled changes to label names, codebook order, and label-definition mappings. Clearer codebooks substantially improve classification performance, especially for fine-grained event classification. However, these predictive gains do not fully translate into behavioral reliability. Models may produce valid labels and recover definitions while still failing behavioral reliability tests under controlled codebook changes. These findings suggest that codebook-guided LLM systems should be evaluated not only by accuracy, but also by whether they preserve the coding logic that makes coded outputs meaningful for social-science research. Comments: 14 pages, 3 figures, 11 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.06781 [cs.CL] (or arXiv:2606.06781v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.06781 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-56] A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

【速读】：该论文旨在解决现有评估方法无法准确判断长上下文或检索增强型语言模型是否真正利用了所提供的证据这一核心问题。传统指标如最终答案准确率、召回率及引用重叠度，均无法区分模型是依赖参数化记忆作答、虽获得正确段落却仍失败，抑或仅引用证据而未将其转化为有效答案。为此，论文提出一种匹配的四条件证据可用性诊断协议（ONCU，Oracle-Reference Evidence Advantage），包含无证据、完整上下文、检索到的证据与理想参考证据四种情境，在固定示例、提示模板、评分字段、检索设置及有效性校验条件下进行诊断。该协议通过计算仅在分母有效组中成立的ONCU值，来估计模型从理想参考证据中恢复优势的能力，并单独报告无需分母的答对率、证据使用率、检索覆盖率和失败审计等指标。实证研究在Qwen、Gemma、Llama与Mistral系列共五款本地开源模型上，针对Controlled-ONCU-safe16K、HotpotQA-ONCU与2WikiMultiHopQA-ONCU三个数据集进行了18,000次兼容ONCU的预测评估。主要发现为任务依赖性的瓶颈划分：在可控合成任务中，主要暴露完整上下文利用失败；而在真实的多跳推理任务中，则主要暴露出检索链覆盖不足的问题。该研究的核心贡献在于构建了一种可分离“无证据可回答性”、“理想证据可恢复性”、“完整上下文利用能力”以及“基于检索条件的利用能力”的诊断协议，而非提供单一得分排行榜来评价长上下文或检索增强系统。

链接: https://arxiv.org/abs/2606.06758
作者: Haizhou Xia
机构: University of Western Ontario (Western Ontario大学)
类目: Computation and Language (cs.CL)
备注: 52 pages, 34 tables, 1 figure

点击查看摘要

Abstract:Final-answer accuracy, retrieval recall, and citation overlap do not by themselves identify whether a long-context or retrieval-augmented language model used the evidence it was given. A model can answer from parametric memory, fail despite receiving the right passages, or cite evidence without converting it into the requested answer. This paper proposes a matched four-condition evidence-availability protocol–no evidence, full context, retrieved evidence, and oracle-evidence reference–for diagnosing evidence utilization under fixed examples, prompts, score fields, retrieval settings, and validity checks. ONCU is used as a protocol-bound estimator of recovered oracle-reference evidence advantage and is computed only for denominator-valid groups; denominator-free answer, evidence, retrieval, and failure-audit metrics are reported separately. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families across Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, with 18,000 ONCU-compatible predictions. The main finding is a task-dependent bottleneck split: controlled synthetic settings primarily expose full-context utilization failures, whereas the tested realistic multi-hop settings primarily expose retrieval-chain coverage failures in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. The contribution is a diagnostic protocol for separating no-evidence answerability, oracle-evidence recoverability, full-context utilization, and retrieval-conditioned utilization, rather than a single-score leaderboard for long-context or retrieval-augmented systems.

[NLP-57] PromptPrint: Behavioral Biometrics Through Natural Language Prompting in LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）交互中短文本提示（prompt）是否具备稳定、可识别且具有个体特异性的身份标识信号这一核心问题。传统作者归属研究多聚焦于长篇表达性文本，而现代LLM交互以简短、任务驱动的提示为主，其身份可识别性尚未得到充分验证。本文提出“提示指纹”（PromptPrint）概念，假设用户在词汇选择、句法结构和话语模式上的习惯性特征可构成可学习的行为生物特征。解决方案的关键在于：通过分析1,034名用户共20,680条真实提示数据，揭示了三个核心发现——首先，词法表征显著优于语义编码器，支持“词法稳定性假说”，即身份信息主要体现在表层词汇选择而非深层语义意图；其次，风格化特征呈现“独特性-一致性悖论”，即用户在群体层面高度独特，但在不同上下文间行为表现不一致；最后，对抗性分析表明，身份信号对轻微词汇扰动具有鲁棒性，但易受语义改写攻击，呈现出明确的脆弱性梯度。整体结果表明，基于提示的身份识别在大规模场景下具有强性能，确立了提示行为作为新型行为生物特征的可行性，为LLM交互中的用户建模、安全与隐私保护提供了新视角。

链接: https://arxiv.org/abs/2606.06755
作者: Shaiv Patel,Kartik Narayan,Vishal Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Authorship attribution research has traditionally focused on long-form, expressive texts; however, interactions with large language models (LLMs) are typically brief and task-driven prompts. This raises a fundamental question: do such prompts contain a stable, author-identifiable, and distinctive signal? We introduce PromptPrint, a systematic study of prompt-based identity, the hypothesis that a user’s habitual vocabulary, syntax, and discourse patterns form a learnable behavioral biometric. Using 20,680 real prompts from 1,034 users, we establish three key findings. First, lexical representations significantly outperform semantic encoders, supporting the “lexical stability hypothesis”: identity is primarily encoded in surface-level word choice rather than abstract intent. Second, stylometric features exhibit a “uniqueness-consistency paradox”: users are highly distinctive across the population, yet behaviorally inconsistent across contexts. Third, adversarial analysis reveals a clear vulnerability spectrum: identity signals are robust to minor lexical perturbations but degrade substantially under semantic paraphrasing. Overall, our results demonstrate strong identification performance at scale, establishing prompt-based identity as a viable behavioral biometric. This work introduces a new perspective on user modeling in LLM interactions, with important implications for security and privacy. Data and code will be released upon the acceptance of our work.

[NLP-58] Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在检索增强生成（Retrieval-Augmented Generation, RAG）框架下仍存在的幻觉（hallucination）问题，尤其针对现有幻觉检测方法仅依赖生成答案与检索片段之间的表面相似性、忽视证据片段与答案主张之间结构关系的局限性。其解决方案的关键在于提出一种名为“证据图一致性”（Evidence Graph Consistency, EGC）的新框架：该框架为每个生成响应构建局部证据图，并基于图结构计算五种一致性度量作为幻觉指示器。实验结果表明，EGC在不同模型家族间表现出显著差异——在Llama-2系列模型中，图一致性特征呈现预期的诊断方向，但在GPT-4、GPT-3.5和Mistral-7B中则出现系统性反转，揭示了不同模型家族存在本质不同的幻觉模式，进而说明基于嵌入的图一致性无法作为跨模型通用的幻觉检测信号。

链接: https://arxiv.org/abs/2606.06748
作者: Jianru Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the International Conference on Advanced Machine Learning and Data Science; to appear in the IEEE Xplore proceedings

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) reduces but does not eliminate hallucination in large language models. Existing detection methods rely on flat similarity between generated answers and retrieved passages, ignoring structural relationships among evidence pieces and answer claims. We propose Evidence Graph Consistency (EGC), a framework that constructs a local evidence graph per response and computes five structural consistency measures as hallucination indicators. Evaluated on the full question answering split of RAGTruth across six LLMs (5,767 responses), EGC reveals a consistent model-family split: graph consistency features show the expected diagnostic direction for hallucinations in Llama-2 models but exhibit systematic reversal in GPT-4, GPT-3.5, and Mistral-7B. This reversal suggests qualitatively different hallucination patterns across model families and indicates that embedding-based graph consistency cannot serve as a model-independent hallucination detection signal.

[NLP-59] When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

【速读】：该论文旨在解决大语言模型在推理任务中因对所有输入均调用耗时的慢速推理（slow reasoning）而导致计算开销过大且常不必要的问题。其核心解决方案是提出一种响应条件型抑制性推理框架（IDPR），关键在于设计一个基于快速回答（fast answer）及其辅助证据（包括置信度、logit差距、可解析性及生成成本等）的抑制控制器（inhibition controller），以动态判断是否应释放快速答案或将其抑制并触发慢速推理。与仅依赖输入特征的路由机制不同，该方法通过响应条件决策显著提升了推理效率与准确性，在保持相同慢速调用预算的前提下，仅在8.20%的样本上触发慢速推理，将数学推理测试集准确率从47.90%提升至48.92%，优于随机路由（46.76%）和最优置信度基线（48.22%），并展现出更高的修正精度，表明响应条件型抑制机制能更精准识别出需慢速推理的快速答案。

链接: https://arxiv.org/abs/2606.06745
作者: Zhixuan He,Yue Feng
机构: University of Birmingham, United Kingdom
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework for response-conditioned inhibitory deliberation. IDPR first generates a concise intuitive answer and then uses an inhibition controller to decide whether that specific response should be released or suppressed in favor of slow reasoning. Unlike input-only routers, the inhibition controller conditions on the fast answer and fast-side evidence, including confidence, logit margin, parseability, and generation cost. We train the controller from paired fast-slow outcomes and select the inhibition threshold on a held-out validation set under an accuracy-first slow-call budget. On a held-out 5,000-example mathematical reasoning test set, IDPR invokes slow reasoning on only 8.20% of examples and improves accuracy from 47.90% to 48.92%. Under the same slow-call budget, random routing decreases accuracy to 46.76%, while the strongest confidence-based baseline reaches 48.22%. IDPR also achieves the highest corrective precision, showing that response-conditioned inhibition better identifies fast answers that benefit from slow reasoning.

[NLP-60] HybridCodec: Fast Dual-Stream Semantically Enhanced Neural Audio Codec INTERSPEECH2026

【速读】：该论文旨在解决神经音频编解码器在语音表征学习中语义与声学特征耦合严重、难以实现有效解耦的问题，尤其针对多模态大语言模型（Multimodal Large Language Models）对高质量、可解释语音令牌化表示的需求。现有方法主要采用两种策略：一是将自监督学习（SSL）模型的语义信息蒸馏至首个矢量量化（RVQ）层，二是维持语义与声学特征的独立流处理。本文提出一种统一架构——HybridCodec，其关键在于融合上述两种范式：通过分离的语义与声学分支结构实现双流建模，并将SSL表示蒸馏至语义流中。这一设计不仅保证了语义与声学特征的强解耦性，且在推理阶段无需依赖外部SSL模型，提升了实用性与效率。实验表明，HybridCodec在领域内测试集上表现出更优的语义专属性（RVQ-1），同时保持了具有竞争力的重建性能（RVQ-all），并在域外及零样本跨语言场景下展现出优异鲁棒性，相较现有双流模型实现3倍加速。

链接: https://arxiv.org/abs/2606.06743
作者: Arjun Gangwar,S Umesh
机构: Indian Institute of Technology, Madras (印度理工学院马德拉斯分校)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

点击查看摘要

Abstract:The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main approaches to introduce semantic information into codec models: one distills semantic information from SSL representations into the first RVQ layer, while the other maintains separate streams for semantic and acoustic features. We propose HybridCodec, a unified architecture that combines both paradigms. It employs separate semantic and acoustic branches while distilling SSL representations into the semantic stream. This design ensures strong disentanglement without requiring an SSL model during inference. HybridCodec shows superior semantic specialization (RVQ-1) on in-domain test set and competitive reconstruction (RVQ-all). We demonstrate its robustness in out-of-domain and zero-shot cross-lingual settings, achieving a 3x speedup over existing dual-stream models.

[NLP-61] OpenSkill: Open-World Self-Evolution for LLM Agents

【速读】：该论文旨在解决开放世界中自进化智能体（self-evolving agents）在部署后缺乏可用学习回路的问题，即传统方法依赖于预标注技能、成功轨迹或验证器信号等监督信号，而真实开放环境往往仅提供任务提示（task prompt），无法提供上述任何监督信息。其核心挑战在于如何在无目标任务监督的条件下，实现技能与验证信号的自主构建。解决方案的关键在于提出OpenSkill框架：通过从文档、代码仓库和网络等开放世界资源中获取具身化知识与验证锚点（verification anchors），将这些信息合成可迁移的技能，并基于自建的虚拟任务对技能进行迭代优化，其中虚拟任务的评估依据为自建锚点而非目标答案。该机制使开放世界同时成为知识来源与无需监督的训练环境，仅在最终评估阶段引入目标任务监督。实验表明，OpenSkill在三个基准测试和两种目标智能体上均实现了最优的自动化通过率，且其技能具备跨模型迁移能力，自建验证器与真实结果高度一致，证明了其在无监督条件下的有效性与鲁棒性。

链接: https://arxiv.org/abs/2606.06741
作者: Zhiling Yan,Dingjie Song,Hanrong Zhang,Wei Liang,Yuxuan Zhang,Yutong Dai,Lifang He,Philip S. Yu,Ran Xu,Xiang Li,Lichao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 4 figures and 8 tables. Code is avalable at this https URL

点击查看摘要

Abstract:Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.

[NLP-62] Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations INTERSPEECH2026

【速读】：该论文旨在解决多语言、多说话人语音生成中因离散语音单元（discrete speech units）在聚类过程中纠缠语音音位、说话人及语言信息而导致的说话人混叠与跨语言干扰问题。其核心解决方案在于系统分析基于BigVGAN的单元声码器（unit vocoder）在不同聚类规模与条件策略下的表现，关键发现为：聚类规模通过提升音位可区分性主导语音可懂度，而显式说话人条件化是防止身份坍缩（identity collapse）的必要手段；此外，在较低聚类规模下，语言监督能有效缓解单位模糊性，因为此时不同语言间的相似音素易被映射至相同聚类ID，而更大聚类规模则有助于逐步分离这些跨语言音素。

链接: https://arxiv.org/abs/2606.06740
作者: Naman Kothari,Arjun Gangwar,Adarsh Arigala,S Umesh
机构: National Institute of Technology, Trichy (印度理工学院特里奇分校); Indian Institute of Technology, Madras (印度理工学院马德拉斯分校)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

点击查看摘要

Abstract:Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.

[NLP-63] Modular Monolingual Adaptation using Pretrained Language Models ACL2026

【速读】：该论文旨在解决低资源语言（low-resource languages）在构建单语语言模型时面临的挑战，即传统方法依赖全模型微调（full model fine-tuning）虽能实现知识迁移，但对计算资源要求高且在数据极度匮乏时效果受限。其核心问题在于：全模型微调是否为必要手段？为此，论文提出一种更模块化的方法——通过替换词元（tokens），冻结对应嵌入层（embedding layer），仅微调其余网络参数。该方案的关键创新在于：利用语言特定的分词器（language-specific tokenizer）并保持预训练嵌入不变，从而在降低参数量与计算成本的同时，提升模型对低资源语言的适应性。实验以苏格兰盖尔语、爱尔兰语和克丘亚语（8.5k训练样本，属极低资源语言）为例，在自然语言理解任务（掩码填充、命名实体识别、词性标注）上验证了该方法的有效性，表明该策略在保证性能的前提下显著提升了训练效率与可扩展性。

链接: https://arxiv.org/abs/2606.06738
作者: Nalin Kumar,Ondřej Dušek
机构: Charles University, Faculty of Mathematics and Physics; Institute of Formal and Applied Linguistics
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Industry Track

点击查看摘要

Abstract:Building monolingual language models (LMs) for low-resource languages typically relies on adapting pretrained language models (PLMs) by finetuning the whole model on the target language. This approach is widely favored over training from scratch, as it enables effective knowledge transfer. Additionally, prior work has shown that using a language-specific tokenizer can enhance the adaptability. In this work, we hypothesize that full model tuning is often unnecessary and propose a more modular approach. Specifically, we replace the tokens, freeze the corresponding embeddings, and tune the rest of the model. We use Scottish Gaelic, Irish, and Quechua for our experiments, with Quechua being a very low-resource language (8.5k training instances). Evaluation on natural language understanding (NLU) tasks – mask filling, NER, and POS – shows that our proposed approach improves performance when adapting models to low-resource languages. Additionally, we provide a comprehensive analysis of the effectiveness of training strategies, the choice of pretrained embeddings, and models.

[NLP-64] Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles ACL

【速读】：该论文旨在探究主题情感（topic sentiment）是否对感知的政治意识形态（perceived political ideology）具有因果影响，以及这一因果关系是否依赖于意识形态标签的赋予者。其核心问题是：在不同标注范式下，情感与意识形态之间的关联是否存在真实因果效应，还是仅反映模型训练中的伪相关（spurious coupling）。解决方案的关键在于采用双重机器学习（Double Machine Learning, DML）与社区层面的中介分析（community-level mediation analysis），系统比较四类标注主体——专家人类标注者、基线与微调后的GPT-4o-mini、Llama-3.3-70B——所生成的意识形态标签。研究发现，唯有经过微调的GPT-4o-mini在分类准确率（F1=72.48）上表现最佳，并在社区层面显示出显著的处理效应和自然直接效应（NDEs），而人类标注者则未呈现显著因果效应。这一结果被解释为“捷径学习”（shortcut learning）的证据：微调过程使模型内化了数据中非语义性的、人为构建的情感—意识形态耦合关系，该关系在人类判断中并不存在。这种耦合在基于F1分数的评估中不可见，却可能误导将大语言模型（LLM）标注作为银标（silver labels）或人类判断代理用于下游因果推断，从而引发严重的偏差。

链接: https://arxiv.org/abs/2606.06715
作者: Upasana Chatterjee
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL SRW 2026

点击查看摘要

Abstract:We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3.3-70B. We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72.48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation. We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment–ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.

[NLP-65] Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

【速读】：该论文旨在解决自回归语言模型（Autoregressive Language Models, ARLMs）向扩散语言模型（Diffusion Language Models, DLMs）迁移过程中存在的两大核心问题：一是从自回归预测目标切换到扩散模型目标所引发的分布偏移，导致原有模型在预训练阶段积累的知识丢失；二是标准扩散语言模型存在的训练-推理不匹配问题，即训练时基于随机掩码序列进行优化，而推理阶段实际使用的是置信度驱动的解码轨迹。为解决上述挑战，论文提出一种在线策略扩散语言模型（On-Policy Diffusion Language Model, OPDLM），其关键在于引入在线策略蒸馏（On-Policy Distillation, OPD）机制。具体而言，OPDLM通过自蒸馏方式训练，其中学生模型（具有双向注意力的ARLM）自主生成推理轨迹，教师模型（原始冻结的ARLM）则在这些轨迹上提供目标输出概率，实现知识传递。该方法在训练过程中保持与推理一致的“在线策略”模式，彻底消除训练-推理不匹配问题，同时通过来自原始模型的蒸馏显著提升知识保留能力。实验表明，OPDLM仅需15至7,000倍更少的训练数据即可在多种任务上达到优异性能，有效规避了传统扩散模型预训练的高昂成本，将扩散模型转换定位为自回归模型的一种高效后训练范式。

链接: https://arxiv.org/abs/2606.06712
作者: Xingyu Su,Jacob Helwig,Shubham Parashar,Atharv Chagi,Lakshmi Jotsna,Degui Zhi,James Caverlee,Dileep Kalathil,Shuiwang Ji
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

[NLP-66] Signal-Driven Observation for Long-Horizon Web Agents

【速读】：该论文旨在解决长期运行的网页智能体（web agent）在每一步操作中频繁摄入原始文档对象模型（DOM）和可访问性树所导致的上下文渐进式退化问题，这种高频观测行为使得智能体在任务完成前便因上下文过载而丧失有效推理能力。其核心解决方案是提出信号驱动观测（Signal-Driven Observation, SDO），关键在于将观测频率与动作频率解耦：通过一个专用子调用仅读取完整的DOM，但仅返回与任务相关的内容及其选择器，并且仅在轻量级信号检测器触发时才重新执行——这些信号包括URL变更、新可见的交互元素、操作失败或外部浏览器事件等。该设计显著减少了冗余信息摄入，从而缓解了上下文退化，同时强调将观测压缩作为网页智能体架构中的核心决策原则。

链接: https://arxiv.org/abs/2606.06708
作者: Shubham Gaur,Ian Lane
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:Web agents operating over long horizons ingest raw DOM and accessibility trees – routinely tens of thousands of tokens – at every action step, causing progressive context degradation that erodes reasoning well before tasks complete. We argue that this coupling of observation frequency to action frequency is an architectural mistake. Drawing on the insight from Recursive Language Models that querying a document outperforms reading it wholesale, we propose Signal-Driven Observation (SDO): a dedicated sub-call reads the full DOM but returns only task-relevant elements and their selectors, and is re-invoked only when a lightweight signal detector fires – triggered by URL transitions, newly visible interactive elements, action failures, or exogenous browser events. We outline the open problems SDO introduces and call on the community to treat observation compression as a core architectural decision in web agent design.

[NLP-67] RECAP: Regression Evaluation for Continual Adaptation of Prompts

【速读】：该论文旨在解决生成式 AI（Generative AI）在生产环境中面对动态变化的约束条件时，如何实现主动适应的问题。现有基准普遍假设约束条件静态不变或采用反应式（reactive）的评估反馈机制，无法真实反映实际部署中需在下一交互即刻合规的严苛要求。为此，论文提出 RECAP 基准，通过严格遵循“先适应、后测试”的主动适应范式，量化评估提示优化方法在约束演化场景下的持续学习能力（如遗忘、退化与正向迁移）。实验结果表明，尽管所测试的六种方法在引入更高延迟的情况下仍未能显著提升性能，说明当前为离线或反应式场景设计的方法难以胜任主动适应需求。其关键在于：生成式 AI 系统必须具备在未见测试数据前即基于新约束规范实现泛化的能力，从而确保在动态生产环境中保持合规性与鲁棒性，凸显了面向主动适应范式的新型提示优化方法的设计迫切性。

链接: https://arxiv.org/abs/2606.06698
作者: Harsh Deshpande,Kushal Chawla,Sangwoo Cho,William Campbell
机构: Capital One (资本一号)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or reactive protocols with evaluation feedback. We introduce RECAP, a benchmark that measures continual-learning phenomena (forgetting, regression, forward transfer) at the constraint level under a strictly proactive adapt-then-test protocol: prompt optimization methods receive only the constraint specification and must generalize before seeing any test data. Evaluating six methods across four LLMs and three schedules with evolving constraints, we find that these methods show no significant improvement in performance, even after incurring a higher latency. These methods, designed for offline or reactive settings, are inadequate for the proactive paradigm. Our work emphasizes the growing need for designing proactive prompt adaptation methods, where the models must remain robust to evolving needs in deployment.

[NLP-68] HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find How They Reason and What They Rule

【速读】：该论文旨在解决香港法院判决文书在法律话语分析领域长期缺乏专家标注语料库的问题，从而阻碍了对判决文本深层结构与推理机制的系统研究。其核心解决方案是构建首个面向句子层级的专家标注法律话语语料库——香港判决话语数据集（HKJudge），涵盖香港司法体系五个审级的刑事判决，包含约29万句、650万词元，并由10名法律语言学专家完成标注，标注一致性达到κ = 0.8。关键创新在于设计了一套双层话语标注框架：在句子层面，将每个句子标注为26种修辞角色，以刻画事实认定、推理过程与裁判结论；在语段层面，进一步标注“指控”“刑期”“罚金”三类量刑要素。基于此，论文提出了“修辞角色分类”与“法律要素抽取”两项任务，并首次对四种基于BERT的模型、两种开源大语言模型（LLM）及四种商用大语言模型在零样本与微调设置下的表现进行了基准评估。研究证明，句子层级的话语标注对于建模香港判决文书的结构化逻辑具有显著价值，为未来法律判决预测等研究提供了高质量的数据基础。

链接: https://arxiv.org/abs/2606.06679
作者: Xi Xuan,Wenxin Zhang,Yufei Zhou,King-kui Sin,Chunyu Kit
机构: City University of Hong Kong (香港城市大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK’s court hierarchy, comprising \sim 290k sentences and \sim 6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of \kappa = 0.8 . We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at this https URL.

[NLP-69] What Do People Actually Want From AI? Mapping Preference Plurality

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）通过人类反馈强化学习（Reinforcement Learning from Human Feedback, RLHF）进行对齐时所面临的根本性问题。现有方法在实践中存在诸多局限：其一，将多元甚至冲突的人类偏好聚合为单一统一的奖励信号，忽视了个体间价值取向的显著差异；其二，依赖于代表性不足的样本数据，导致对偏好的建模失真；其三，仅使用二元比较（binary comparisons）作为反馈形式，无法捕捉复杂、情境化的用户需求。研究基于来自75个国家的1,500份开放文本回复（PRISM数据集），揭示了人们实际期望的AI系统特征及其内在矛盾。结果表明，绝大多数价值诉求仅被少于四分之一的受访者提出，唯有“真实性”（truthfulness）达到49%的提及率，但即便如此，不同用户对“真实性”的定义亦存在本质分歧——有人要求引用来源，有人强调专家意见，甚至有人主张呈现非主流观点。此外，诸如“拟人化行为”和“AI防护机制”等能力与特性也极具争议，部分用户支持而另一些则强烈反对。研究还发现，用户常依据上下文区分默认行为与请求响应之间的边界，这种细微差别无法通过二元比较有效表达。因此，该研究的核心结论在于：当前对齐实践将复杂、情境化且充满争议的真实偏好简化为普适性的偏好模型，这一过程本质上构成一种“认识论暴力”（epistemic violence）。解决方案的关键在于突破对统一奖励模型的依赖，转而构建能够识别并建模多元、动态、语境敏感且可能相互冲突的价值体系的新型对齐范式，从而实现更真实、更具包容性的智能系统设计。

链接: https://arxiv.org/abs/2606.06674
作者: Julia Sepúlveda Coelho,Scott A. Hale
机构: Oxford Internet Institute, University of Oxford(牛津互联网研究所，牛津大学); Meedan(梅丹); Oxford Internet Institute, University of Oxford(牛津互联网研究所，牛津大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

点击查看摘要

Abstract:Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people’s preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by “truthfulness”, they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do “by default” versus “if requested”) that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users’ clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence. Comments: Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26) Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2606.06674 [cs.CL] (or arXiv:2606.06674v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.06674 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805689.3812398 Focus to learn more DOI(s) linking to related resources

[NLP-70] he Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在微调后出现的广泛过度泛化问题，特别是“涌现性错位”（Emergent Misalignment, EM）现象——即在窄任务上进行微调后，模型在语义无关的测试领域中表现出显著的性能偏离。其核心问题是：为何微调行为会通过共享的输入结构（如聊天模板标记）“搭便车”式地传播至域外查询，导致非预期的泛化错误。解决方案的关键在于提出“搭便车假说”（Piggyback Hypothesis），认为聊天模板标记（chat-template tokens）可作为载体，将微调后的行为迁移到域外输入中。研究通过实证发现，仅对前缀（prefix，即用户查询前的固定标记序列）进行微小扰动或用未微调模型的前缀表示替换，即可在不改变用户输入的前提下恢复对齐。基于此，提出令牌正则化微调（Token-Regularized Finetuning, TReFT），在训练过程中对特定令牌的表示施加正则化，以抑制其跨域行为。实验表明，TReFT在多个模型与数据集上均有效降低EM，且保持域内学习能力；在法律领域微调的Llama-3.1-8B模型上，其EM减少幅度比数据交错法高出33.5%。此外，TReFT在拒绝、工具使用及回避等其他窄微调场景中也表现出显著效果，平均减少54.3%的域外泛化偏差，验证了搭便车假说的普适性。本研究揭示了LLMs可能通过共享输入特征无意间传播行为，为实现更受控的微调提供了新路径，并呼吁深入探究共享输入结构如何跨域“搭便车”传递模型行为。

链接: https://arxiv.org/abs/2606.06667
作者: Jiachen Zhao,Zhengxuan Wu,Aryaman Arora,Yiyou Sun,David Bau,Weiyan Shi
机构: Northeastern University (东北大学); Stanford University (斯坦福大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The mechanisms behind LLMs’ broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries. We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. Building on this finding, we propose Token-Regularized Finetuning (TReFT), which regularizes specific token representations during training to mitigate EM. Across different models and multiple EM-inducing datasets, TReFT reduces EM while preserving in-domain learning. On Llama-3.1-8B finetuned on the legal domain, TReFT achieves 33.5% more EM reduction than data interleaving with a retain set of aligned examples. We further show that TReFT extends to other narrow-finetuning settings, including abstention, tool use, and refusal (off-topic generalization is reduced by 54.3% on average), supporting the Piggyback Hypothesis. Broadly, our work highlights that LLMs may learn and generalize in unintended ways and suggests a path toward more constrained finetuning. It also calls for further study of how shared input features can piggyback model behavior across domains.

[NLP-71] CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

【速读】：该论文旨在解决自然文本中复杂推理形式化建模的挑战，即如何从非结构化文本中自动构建符合高级论证框架（如Carneades论证框架，CAF）的结构化论证模型。现有论点挖掘（Argument Mining, AM）技术虽能识别基本论点与前提，但难以捕捉诸如前提类型、证明标准及论证模式等深层结构信息。为此，论文提出一种名为CAF-Gen的多智能体自动化框架，其核心在于采用迭代式“创建-评审”（Creator-Reviewer）协同机制：由创建者智能体生成初始论证结构，再经评审者智能体进行批判性验证，以确保结构完整性与逻辑一致性。该多智能体协作机制有效缓解了单次生成模型常见的结构不稳定性问题。实验表明，该方法显著提升了生成数据的质量，实现了与原始标注的高度对齐，并构建出结构更丰富的论证模型，证明了多智能体系统在自动化形式化论证建模中的有效性与鲁棒性。

链接: https://arxiv.org/abs/2606.06646
作者: Jakub Bąba,Jarosław Chudziak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the proceedings of ICCCI 2026

点击查看摘要

Abstract:Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent’s output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.

[NLP-72] How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在推理过程中出现的失败问题，尤其是这些失败在推理轨迹中表现出可识别的模式。其核心问题是：如何区分不同类型的推理失败，并据此发展出有效的检测与应对策略。解决方案的关键在于通过分词级不确定性信号（token-level uncertainty signals）对推理过程进行分析，识别出两类本质不同的失败模式：一是“固执性失败”（committed failure），即模型早期即锁定错误推理路径，其诊断标志为存在一个“承诺点”（commitment point），在此之后增加更多上下文反而会降低故障检测效果；二是“持续性不确定性”（persistent uncertainty），即不确定性在整个推理过程中逐步积累，必须依赖完整的推理轨迹才能有效区分成功与失败。该框架在23个模型-数据集组合中均表现出一致的诊断签名，且其可证伪预测在20/23案例中成立，显著优于随机水平。此外，研究进一步揭示了该失败模式分类对自洽性方法（self-consistency）的直接影响，明确了在何种情况下不确定性信号可增强其性能，以及何时可选择性跳过该步骤。这一成果为理解LLM推理失败的可检测性提供了理论基础，并支持动态适配检测策略的设计。

链接: https://arxiv.org/abs/2606.06635
作者: Tanvi Thoria,Kiana Jafari,Marc R. Schlichting,Mykel J. Kochenderfer
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework’s falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.

[NLP-73] UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在模拟真实系统随机性时，因倾向于生成单一合理答案而导致无法准确捕捉目标分布的问题。其核心挑战在于：当前模型虽可通过增强推理或输出多样性提升表现，但缺乏对目标分布的校准能力，难以生成与真实分布一致的样本。解决方案的关键是提出UnpredictaBench评估基准，通过448个涵盖经典统计分布、随机程序诱导分布及自然语言描述的随机过程问题，系统评估模型在生成样本时对黑箱目标分布的逼近能力；并引入KS@N这一通用评价指标，基于柯尔莫哥洛夫-斯米尔诺夫（Kolmogorov-Smirnov, KS）检验，衡量模型输出样本与真实样本在不同样本规模N下的分布一致性，其中更大的N代表更高难度。实验表明，现有开放与专有模型在KS@100上的得分跨度极大，最高不足40%，且无模型能突破此瓶颈，揭示了分布采样作为一项关键能力仍存在显著提升空间。

链接: https://arxiv.org/abs/2606.06622
作者: Amirhossein Abaskohi,Amirhossein Dabiriaghdam,Liang Luo,Ellie Dingqiao Wen,Lele Wang,Giuseppe Carenini,Peter West
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

[NLP-74] Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning EMNLP2026

【速读】：该论文旨在解决大语言模型在跨语言场景下存在事实一致性问题，即模型虽在英语数据上习得丰富世界知识，但在其他语言中难以可靠地表达这些知识。为系统研究并缓解此问题，作者构建了PolyFact——一个大规模平行多语言事实问答数据集，涵盖12种类型学差异显著的语言，共包含10万条基于Wikidata的事实。针对Qwen-2.5-7B和OLMo-2-1124-7B模型，论文对比了轻量级持续预训练（CPT）、监督微调（SFT）以及基于组相对策略优化（GRPO）的强化学习方法在提升跨语言事实召回能力方面的表现。结果表明，GRPO在跨语言一致性与未见语言泛化能力上均显著优于SFT，而基于平行语料的CPT仅带来有限改进。机制分析揭示，GRPO通过降低MLP层与注意力头中的语言特异性，重构了多语言路由机制，促进了更共享的跨语言表征。该研究的关键在于提出并验证了以GRPO为核心的优化策略，有效增强了模型在多语言环境下的知识一致性与泛化性能。

链接: https://arxiv.org/abs/2606.06586
作者: Jonathan von Rad,Louis Arts,George Burgess,Eleftheria Kolokytha,Harry O’Donnell,Ektor Oikonomidis Doumpas,Eduardo Sanchez,Yao Lu,Pontus Stenetorp
机构: University College London, Centre for Artificial Intelligence (伦敦大学学院人工智能中心)
类目: Computation and Language (cs.CL)
备注: Under Review at EMNLP 2026

点击查看摘要

Abstract:Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.

[NLP-75] Position: Dont Just “Fix it in Post”: A Science of AI Must Study Training Dynamics ICML

【速读】：该论文试图解决的核心问题是：当前人工智能（AI）研究普遍将模型视为静态的、训练完成后固定的产物，从而忽视了模型行为在训练过程中动态演化的过程，导致对模型性能、偏见、鲁棒性及安全性等关键属性的理解停留在事后分析层面。其根本缺陷在于缺乏对训练动态过程的科学性理解。解决方案的关键在于构建一种“AI科学”（science of AI），即从对训练过程的动态建模出发，深入研究数据、目标函数、网络架构与优化动力学共同作用下的系统演化机制。这一科学范式要求实现从早期训练信号预测最终模型能力、在训练轨迹偏离预期时进行干预，乃至主动设计可可靠生成期望属性的训练流程。论文强调，尽管规模定律（scaling laws）已在损失预测方面取得成功，但亟需将类似理论拓展至能力、偏见、鲁棒性与安全等更复杂的属性；为此，需融合科学哲学的历史经验，推动机制可解释性（mechanistic interpretability）、公平性、记忆现象与简化偏置（simplicity bias）等领域的发展，并明确界定当前亟待解决的关键开放问题。

链接: https://arxiv.org/abs/2606.06533
作者: Stella Biderman,Mohammad Aflah Khan,Niloofar Mireshghallah,Catherine Arnett,Fazl Barez,Naomi Saphra
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted as an oral to the ICML: this https URL

点击查看摘要

Abstract:What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. Yet much of AI research treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that a science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior. Such a science should support progressively stronger forms of understanding: predicting outcomes from early training signals, intervening when trajectories go wrong, and ultimately designing training procedures that more reliably produce desired properties. Scaling laws have made prediction routine for loss; the challenge is extending this success to capabilities, biases, robustness, and safety-relevant behaviors. We articulate requirements for such theories grounded in the history and philosophy of science, examine progress in mechanistic interpretability, fairness, memorization, and simplicity bias, and identify concrete open problems.

[NLP-76] Human Adults and LLM s as Scientists: Who Benefits from Active Exploration?

【速读】：该论文旨在解决成人个体在识别合取性因果规则（conjunctive causal rules）时存在的认知偏差问题，即在被动观察条件下难以有效推断多个原因必须同时存在才能引发结果的复杂因果关系，而对析取性规则（disjunctive rules）则表现更优。这一现象被称为“合取性劣势”（conjunctive handicap）。研究的关键在于考察当学习者拥有主动探索（active exploration）的自主权时，这种认知偏差是否依然存在。通过改进的“blicket检测器”任务范式，研究让成人在自由干预的情境下探查因果结构，发现主动探索显著提升了成人对合取性规则的推理能力，尽管合取规则仍需更多测试才能被正确推断。此外，研究还对比了人类与多种大型语言模型（Large Language Models, LLMs）在相同任务中的表现，结果显示部分先进模型虽在假设推断准确性上接近人类水平，但在探索策略效率上表现较差，且同样表现出合取-析取性能差距，表明当前生成式AI在主动推理和探索行为建模方面仍存在局限。

链接: https://arxiv.org/abs/2606.06464
作者: Mandana Samiei,Eunice Yiu,Anthony GX-Chen,Dongyan Lin,Jocelyn Shen,Blake A. Richards,Alison Gopnik,Doina Precup
机构: Mila - Quebec AI Institute (蒙特利尔人工智能研究所); McGill University (麦吉尔大学); University of California Berkeley (加州大学伯克利分校); New York University (纽约大学); Meta FAIR (Meta公平性研究实验室); MIT Media Lab (麻省理工学院媒体实验室); Montreal Neurological Institute (蒙特利尔神经学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)

点击查看摘要

Abstract:A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified blicket detector’’ task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults’ conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.

[NLP-77] Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

【速读】：该论文旨在解决现有大型语言模型在医疗问答场景中难以根据病例严重程度动态调整响应质量与上下文相关性的问题，尤其在面对从轻度到危重病例的渐进式复杂性时，模型表现不稳定。其核心解决方案是提出一种严重程度感知的多模型框架（severity-aware multi-model framework），关键在于将课程学习（curriculum learning）策略与基于相关性的多模型响应选择机制相结合。具体而言，该框架采用三阶段课程学习：依次在轻度、中度和重度病例上训练多个独立的大规模语言模型，使模型逐步积累领域知识；推理阶段，所有模型并行生成候选回答，通过相关性评估筛选最优输出。该方法在MAQA数据集上的实验表明，结合微调后在BERTScore指标上达到90.30%的性能，显著优于基线模型，验证了课程学习与多模型响应选择协同机制在提升医疗文本生成的准确性与情境适应性方面的有效性。

链接: https://arxiv.org/abs/2606.05510
作者: Ahmed Alansary,Molham Mohamed,Ali Hamdi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 3 figures, IMSA2026

点击查看摘要

Abstract:Telehealth systems have become increasingly important for delivering accessible and timely medical information. Existing large language models often struggle to provide consistent and contextually appropriate medical responses across varying levels of case severity. This limitation highlights the need for models that can effectively adapt to the progressive complexity in medical queries. To address this challenge, we introduce a severity-aware multi-model framework that integrates curriculum training strategy with relevance-based response selection. The proposed framework employs a three-stage curriculum learning strategy, where each model is trained sequentially on mild, moderate, and critical cases to progressively acquire domain knowledge. The approach utilizes five large language models, each independently trained under the same curriculum scheme. During inference, all models generate candidate responses, and the most appropriate response is selected as the final output. The framework is trained and evaluated on the MAQA dataset, which provides annotated medical question-answer pairs. Experimental results evaluated using BERTScore demonstrate that the proposed method achieves superior performance compared to both baseline and fine-tuned models, attaining 86.71% in the baseline setting and 90.30% after fine-tuning. These results highlight the effectiveness of combining curriculum learning with multi-model response selection in improving response quality and relevance in medical text generation.

[NLP-78] Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLM s NEURIPS2025

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在实际应用中面临的提示注入攻击（prompt injection attack）问题，此类攻击通过邮件、用户生成内容等间接输入渠道绕过对齐安全机制，诱导模型产生有害或非预期输出。尽管当前对齐技术已取得进展，但主流LLM仍普遍易受此类攻击，亟需一种无需特定模型微调、具备通用性与高效性的检测方法。其解决方案的关键在于提出零样本嵌入漂移检测（Zero-Shot Embedding Drift Detection, ZEDD），该方法通过量化良性输入与可疑输入在嵌入空间中的语义偏移（embedding drift），识别直接与间接的提示注入行为。ZEDD不依赖模型内部结构、攻击类型先验知识或任务特定训练，仅利用对抗样本-正常样本对，基于余弦相似度计算嵌入差异，从而捕捉真实世界攻击中细微的语义操控。实验基于自构建并重新标注的综合性数据集LLMail-Inject（涵盖五类攻击场景），验证了嵌入漂移作为可迁移且鲁棒的检测信号的有效性，在Llama 3、Qwen 2、Mistral等多架构模型上实现超过93%的检测准确率与低于3%的误报率，展现出轻量级、可扩展的防御能力，为现有LLM系统提供了一种高效集成的通用安全防护层。

链接: https://arxiv.org/abs/2601.12359
作者: Anirudh Sekar,Mrinal Agarwal,Rachel Sharma,Akitsugu Tanaka,Jasmine Zhang,Arjun Damerla,Kevin Zhu
机构: Algoverse AI Research(阿尔戈维斯人工智能研究); University of California, Berkeley(加州大学伯克利分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2025 Lock-LLM Workshop

点击查看摘要

Abstract:Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of 3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.

[NLP-79] Multiscale POD of Transformer Attention Fields: Scale-Selective Analysis via Morlet Scalogram

【速读】：该论文旨在解决大语言模型中注意力机制的尺度依赖性与层次结构复杂性难以量化的问题，特别是如何从注意力场（attention field）中识别出具有能量主导性的多尺度模式。其核心解决方案是提出一种尺度选择性本征正交分解（scale-selective Proper Orthogonal Decomposition, POD）方法，该方法受湍流统计分析中POD技术的启发，结合莫雷特连续小波变换（Morlet continuous wavelet transform）以识别文档集合中注意力滞后结构的主导时间尺度，并在每个尺度上对注意力场集合进行POD分析，提取出能量占优的模态。关键在于：通过POD的最优性定理（Theorem 1），所提取的模态能够最小化整个集合上的平均L²重构误差，从而为每一层提供一个数据驱动的有效秩（effective rank）；同时，基于POD特征值衰减率定义的**谱集中度指数（spectral concentration index）**可有效区分不同层级的注意力场复杂度。该方法无需修改模型架构或引入语言学标注，仅依赖于注意力场的集合统计特性即可揭示深层网络中从早期层到后期层由精细尺度向粗粒尺度演化的层次化组织规律，实现了对注意力机制内在结构的无监督、可解释性分析。

链接: https://arxiv.org/abs/2606.06573
作者: Athanasios Zeris
机构: Independent Researcher, Athens, Greece.
类目: Fluid Dynamics (physics.flu-dyn); Computation and Language (cs.CL); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 23 pages, 3 figures, 4 tables

点击查看摘要

Abstract:We introduce scale-selective Proper Orthogonal Decomposition (POD) for transformer attention fields, inspired by the use of POD for extracting energetically dominant modes from turbulent flow ensembles. The Morlet continuous wavelet transform identifies dominant temporal scales in the attention lag structure across a document ensemble; POD then extracts the energetically dominant modes at each scale from the ensemble of attention fields. The resulting modes reveal layer-dependent scale organisation, with early layers emphasising fine scales and later layers shifting toward coarser scales. We define a spectral concentration index from the POD eigenvalue decay rate and show empirically that it differentiates layers by their attention field complexity. By the classical POD optimality theorem, the extracted modes minimise the average L2 reconstruction error over the ensemble (Theorem 1), giving a data-driven effective rank for each layer. The method requires no architectural modification and no linguistic annotations: dominant attention patterns emerge from ensemble statistics alone. The turbulence analogy is structural rather than physical: we borrow ensemble covariance and modal analysis, not fluid dynamics itself.

信息检索

[IR-0] Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

链接: https://arxiv.org/abs/2606.07502
作者: Songhao Wu,Zhongxin Chen,Yuxuan Liu,Heng Cui,Cong Li,Rui Yan
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: preprint

点击查看摘要

Abstract:Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model’s ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at this https URL.

[IR-1] Bradley-Terry Rankings for Recommender Systems Across Dataset Taxonomies KDD’26

链接: https://arxiv.org/abs/2606.07492
作者: Ekaterina Grishina,Stepan Kuznetsov,Askar Tsyganov,Ilya Ivanov,Daria Korovaitceva,Margarita Rusanova,Uliana Parkina,Alexander Derevyagin,Evgeny Frolov,Sergey Samsonov,Anton Lysenko
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: KDD’26

点击查看摘要

Abstract:The ranking of recommendation algorithms is a challenging problem since model performance is sensitive to dataset characteristics such as sparsity, sequential structure, and scale. This drives a demand for a proper methodology for fair comparison between algorithms. Naive aggregation of performance metrics (e.g., averaging NDCG over benchmarks) can yield misleading rankings, undermining practical selection. To address this problem, we introduce a novel, data-driven ranking methodology based on Bradley-Terry (BT) model. We demonstrate that the obtained ranking depends on key dataset statistics. Additionally, we propose a novel metric for evaluating ranking consistency and demonstrate robustness of our ranking to incomplete data. Finally, we introduce a dataset-specific methodology for ranking algorithms on unseen datasets without running the models, relying on extensions of the Bradley-Terry framework, including BT trees and BT models with covariates.

[IR-2] PaperFlow: Profiling Recommending and Adapting Across Daily Paper Streams

链接: https://arxiv.org/abs/2606.07454
作者: Fuqiang Wang,Song Tan,Zheng Guo,Jiaohao Fu,Xinglong Xu,Bihui Yu,Jie Dong,Zheng Sun,Siyuan Li,Jingxuan Wei,Cheng Tan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 48 pages, 13 figures, 22 tables

点击查看摘要

Abstract:Scientific paper recommendation is typically evaluated as static ranking over a fixed candidate set, yet real scientific reading unfolds as a daily, longitudinal process in which interests shift and feedback accumulates. We introduce PaperFlow, a framework that organizes it into three coupled stages: Profiling, which constructs and maintains a structured, inspectable scholarly profile from heterogeneous cold-start evidence; Recommending, which ranks each date-specific paper stream through multi-signal aggregation under a fixed display budget; and Adapting, which updates user state from semantically distinct feedback signals and models interest drift across days. We further define a longitudinal user-day benchmark that fixes users, dates, candidate pools, visible inputs, and hidden simulated relevance labels under a shared temporal information boundary. The benchmark contains 24 simulated research users, 50 daily paper streams, 1,200 user-day episodes, 20,727 unique papers, and 497,448 episode-paper records. We additionally specify a blind human-evaluation protocol to validate alignment between automatic metrics and expert judgments. Experiments against five scientific recommendation baselines show that PaperFlow achieves the strongest oracle-based ranking, the highest behavioral alignment with simulated reading selections, and the best blind human-evaluation score.

[IR-3] Gated Bidirectional Linear Attention for Generative Retrieval SIGIR2026

链接: https://arxiv.org/abs/2606.07317
作者: Artem Matveev,Vladislav Tytskiy,Sergei Makeev,Sergei Liamaev
类目: Information Retrieval (cs.IR)
备注: 5 pages, 2 figures, 7 tables. Accepted at SIGIR 2026

点击查看摘要

Abstract:In recommender systems, generative retrieval typically uses an encoder-decoder setup: an encoder processes a user interaction history, and an autoregressive decoder then generates recommended items. In large-scale streaming services, active users accumulate very long histories over time. As histories grow, the encoder becomes a major latency bottleneck because softmax attention scales quadratically with sequence length. In our experiments, using bidirectional attention in the encoder substantially improves quality. However, most sub-quadratic attention methods focus on causal attention. We propose Gated Bidirectional Linear Attention (GBLA), a linear-time bidirectional attention layer that extends kernelized linear attention with three lightweight components: local causal mixing (Conv1D), sequence-level key gating for soft forgetting, and a gated RMSNorm output. On a large-scale Yandex Music dataset, a hybrid encoder that interleaves self-attention (SA) and GBLA in a 1:2 ratio (one SA block followed by two GBLA blocks) matches bidirectional self-attention quality. On H100 GPUs, GBLA reaches up to an 8.2\times single-layer speedup at a history length of 32768, compared to FlashAttention-v3. Finally, we show that the same hybrid design generalizes beyond our proprietary setting, consistently preserving self-attention retrieval quality on public Amazon benchmarks. Comments: 5 pages, 2 figures, 7 tables. Accepted at SIGIR 2026 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.07317 [cs.IR] (or arXiv:2606.07317v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.07317 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805712.3808495 Focus to learn more DOI(s) linking to related resources

[IR-4] Constrained Dominant Sets for Multimodal Document Question Answering

链接: https://arxiv.org/abs/2606.07252
作者: Ambuj Mehrish,Sebatiano Vascon
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Long multimodal document question answering is limited by which evidence reaches the reader, rather than by the quantity retrieved. In lengthy documents, findings often recur across figures, captions, and introductory sentences, causing similarity based retrievers in modern multimodal retrieval-augmented generation (RAG) systems to allocate resources to near-duplicates while overlooking complementary evidence. This work introduces a retriever that selects evidence as a Constrained Dominant Set (CDS) on a query-augmented affinity graph, offering three advantages that similarity ranking does not. First, the query is encoded as a hard structural constraint, ensuring that every selected element is directly connected to the question through the cluster anchor. Second, the relevance-redundancy balance is determined automatically by a spectral bound, eliminating the need for manually tuned trade offs required by diversity-aware selectors. Third, the selection process achieves a global equilibrium via replicator dynamics, thereby avoiding the distortions introduced by greedy heuristics. The method is inherently graph-based and does not require training. Using a Qwen3-VL-32B reader, CDS establishes a new state of the art on VisDoMBench ( 66.99 average) and improves over the no-retrieval baseline by 37.1 points on VisDoMBench and 4.8 on MMLongBench-Doc.

[IR-5] FLOWREADER: Min-Cost Flow Optimization for Multi-Modal Long Document QA

链接: https://arxiv.org/abs/2606.07235
作者: Ambuj Mehrish,Sebatiano Vascon
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long, multimodal documents force retrieval-augmented systems to assemble answers from evidence fragmented across text, tables, and slides broken across cells in a long table, spread over multiple slides, or split between a figure and its discussion. Top- k chunk retrieval treats each fragment independently and cannot represent how evidence connects. We introduce FLOWREADER, which reframes evidence assembly as a min-cost flow problem on a multimodal node graph: a single scoring vector h controls source selection (via MMR), sink selection (via a length-aware answerability proxy), and the costs and capacities of every edge. The optimal flow is decomposed into candidate evidence paths, a compact non-redundant subset is selected by entropy-regularized replicator dynamics, and parallel VLM workers under a dual-process gate produce the answer with a single System-2 refinement pass triggered when answer consistency is low or the routed flow is strained. On VisDoMBench, FLOWREADER is best on the two subsets dominated by fragmented evidence PaperTab ( 58.40 , +1.30 over G^2-Reader) and SlideVQA ( 72.93 , +0.62 ) and competitive on SPIQA, FetaTab, and SciGraphQA. Macro-averaged across all five subsets, FLOWREADER ( 65.47 ) is within 0.74 of the strongest baseline (G^2-Reader, 66.21 ). Overall, these results show that min-cost flow performs well on fragmented multimodal evidence, where top- k retrieval fails. It also provides a unified way to control scoring, routing, selection, and adaptive compute together.

[IR-6] HKVM-RAG : Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG ICDE2027

链接: https://arxiv.org/abs/2606.07218
作者: Mingyu Zhang,Ying Ma
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Submitted to ICDE 2027. 13 pages, 3 figures

点击查看摘要

Abstract:Multi-hop RAG poses a data-engineering problem beyond passage matching: under fixed retrieval budgets, a system must organize retrieved text into evidence units that expose answer chains. Dense retrievers score passages independently, while graph-based memories make associations explicit but often rely on pairwise or entity-centered keys that fragment multi-hop evidence. We present HKVM-RAG, a key-value-separated evidence-organization layer. It assembles answer-path hyperedges from cached passage-level LLM evidence tuples and uses them as retrieval keys, while retaining passage text as answer values. To isolate key-space design, our fixed-substrate protocol holds the tuple cache, candidate passages, reader, and evaluation budget constant across pairwise graph and hypergraph variants. Weighted hypergraph key-value retrieval improves over KG-PPR by +3.426 F1 on 2WikiMultiHopQA and +3.592 F1 on MuSiQue; HotpotQA shows that higher structured support coverage need not yield standalone answer-F1 gains. We therefore study WHG-KV as an evidence-control signal rather than a dense-retrieval replacement. Oracle and train-to-dev analyses identify support selection as repairable, and a dense-aware controller combines frozen ColBERTv2 and HKVM rank/score features using out-of-fold HKVM predictions. It reaches 88.846, 65.073, and 85.810 F1 on the three benchmarks, improving over ColBERTv2 by +11.084, +6.763, and +5.966 F1. Source-level ablations show that matched non-WHG structured signals do not match the WHG-KV gains. These results provide bounded evidence that key-value-separated hypergraph organization can serve as a reusable evidence-control mechanism for multi-hop RAG.

[IR-7] RISE: A Rust Library for Inverted Index Search Engines

链接: https://arxiv.org/abs/2606.07187
作者: Angelo Savino,Rossano Venturini
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Inverted indexes are a crucial data structure for efficient information retrieval in large text corpora. They enable fast full-text search by mapping each term to the documents in which it appears, on top of which efficient algorithms quickly retrieve the documents relevant to a user query. We present RISE, a novel inverted index library implemented in Rust, designed to deliver high performance and efficiency for information retrieval tasks. RISE leverages Rust’s safety and performance to provide a robust solution for building and querying inverted indexes, while offering accessible extensibility through its expressive trait system. While developing RISE, we revisited the inverted-index literature, thereby reproducing numerous prior works using this new test bench. We evaluated RISE against existing libraries, demonstrating competitive query performance across various datasets and workloads, with speedups of up to 2x over the current state of the art. Our results indicate that RISE is a promising tool for researchers and practitioners in the field of information retrieval.

[IR-8] Beyond Matching: Category-Guided Latent Intent Reasoning for Generative Retrieval in E-Commerce

链接: https://arxiv.org/abs/2606.07075
作者: Fuwei Zhang,Xiaoyu Liu,Jiajie Jin,Jiale Mao,Wei Chen,Dongbo Xi,Yifan Yang,Peng Yan,Zichao Hao,Zhao Zhang,Fuzhen Zhuang
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative retrieval offers a new paradigm for e-commerce search by mapping user queries directly to product Semantic Identifiers (SIDs). However, e-commerce queries are often short, noisy, attribute-heavy, and associated with multiple category-consistent products, creating a substantial representation gap between natural-language shopping intent and artificially constructed item SIDs. Explicit Chain-of-Thought (CoT) reasoning can help bridge this gap, but its extra generation cost is difficult to reconcile with the low-latency requirements of online e-commerce systems. To address this challenge, we propose CaLIR (Category-guided Latent Intent Reasoning), a category-guided latent intent reasoning framework for e-commerce generative retrieval. Rather than generating explicit textual rationales, CaLIR learns continuous latent intent states before SID decoding and uses product category hierarchies as a natural scaffold for coarse-to-fine intent reasoning. Specifically, we introduce hierarchical semantic reasoning to align latent states with category-level shopping intent, and query-wise reasoning enhancement to model diverse intent paths under multi-positive queries. CaLIR further combines a query-specific dynamic prefix trie, assembled from pre-indexed category-level tries, with reasoning-aware constrained decoding. Experiments on multilingual e-commerce search datasets show that CaLIR achieves a better balance between retrieval effectiveness and inference efficiency than existing methods, while also demonstrating transferability and robustness across induced hierarchies and different generative backbones.

[IR-9] Decision-Theoretic Stopping Rules for Document Screening

链接: https://arxiv.org/abs/2606.07071
作者: Aaron H.A. Fletcher,Mark Stevenson
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Deciding when to stop reviewing the results of a search is a common problem with multiple applications. Existing stopping rules developed within Technology-Assisted Review (TAR) aim to achieve a pre-specified recall target and do not take into account the reason for examining the results, potentially leading to sub-optimal recommendations. This paper applies decision theory to the problem and uses it to derive three practical stopping policies based on the Expected Value of Perfect Information. The approach is applied to two professional search tasks: patent examining and systematic reviewing. Experiments on CLEF-IP and medical systematic review datasets show that the proposed approach generally produces more appropriate stopping decisions than existing methods, as demonstrated by higher net utility under the evaluated cost and payoff settings.

[IR-10] Meaning in Order Order in Meaning: Semantic R-precision for Keyphrase Evaluation

链接: https://arxiv.org/abs/2606.07057
作者: Shamira Venturini,Steffen Kinkel
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the quality of automatically generated keyphrases remains a complex challenge. Traditional metrics either rely on exact lexical matching or consider semantic similarity while ignoring prediction ranking, both of which misalign with how humans judge informativeness and relevance. We introduce Semantic R-Precision (SemR-p), a novel evaluation metric that integrates semantic similarity into the rank-aware R-Precision framework. Designed from a human-centric perspective and inspired by Information Retrieval metrics, SemR-p rewards semantically relevant keyphrases that appear early in the output list. We conducted extensive analyses to assess its semantic sensitivity, ranking awareness, and discriminative power across models and datasets. The results suggest that SemR-p offers a complementary lens for evaluating keyphrase predictions, helping to better reflect user-centred notions of relevance alongside traditional lexical and semantic matching metrics.

[IR-11] SSRLive: Live Streaming Recommendation with Dynamic Semantic ID

链接: https://arxiv.org/abs/2606.06970
作者: Teng Shi,Zhaoheng Li,Yuanhang Qu,Yi Liu,Lixiang Lai,Yuning Jiang
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Live streaming has emerged as one of the fastest-growing forms of online media, enabling instant content broadcasting and real-time engagement between users and streamers. Despite the effectiveness of existing recommendation algorithms in this domain, they often suffer from limited utilization of computational resources, with low FLOPs that hinder further performance enhancement. Generative recommendation techniques, which have gained traction in various industrial tasks, offer a promising avenue for improving live streaming recommendations. However, directly applying generative methods to live streaming is non-trivial due to two major challenges: (1) static semantic IDs (SIDs) cannot reflect the rapidly changing nature of live room content; and (2) generative pipelines generally do not incorporate user–streamer interaction signals (e.g., likes, orders), which are critical for modeling user intent toward both the streamer and showcased products. To address these challenges, we introduce SSRLive: Dynamic Semantic ID-guided Streaming Recommendation for Live platforms. The proposed framework integrates a generative module and a discriminative module in a unified architecture. The generative component employs an encoder-decoder design to produce both static and dynamic SIDs, enabling timely representation of live room content while leveraging multimodal information. The discriminative component refines task-specific representations by combining SIDs with user features, augments them with user-streamer interaction data, and performs multi-task predictions. Online A/B tests in real-world deployment demonstrate tangible benefits: watch time (+3.38%), GMV (+0.72%), follower growth (+3.12%), and interaction volume (+2.92%). These improvements highlight the effectiveness and business value of SSRLive, which is now fully deployed, serving hundreds of millions of active users.

[IR-12] DREAM: Dynamic Refinement of Early Assignment Mappings

链接: https://arxiv.org/abs/2606.06947
作者: Liwei Guan,Huanjie Wang,Hongwei Zhang,Linxun Chen,Zhaojie Liu
类目: Information Retrieval (cs.IR)
备注: 12 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Generative recommendation advances item retrieval by reformulating it as autoregressive generation of Semantic IDs (SIDs), compact token sequences that encode item semantics. While SIDs offer a strong semantic prior, current SID-based methods assign each item a single static identifier through offline tokenization before sufficient user feedback is observed. For cold-start items, this one-shot commitment produces poorly discriminative codes, generating misaligned paths that remain unrefined because the associated tokens are rarely sampled during training. We identify this early static commitment, not model capacity, as the fundamental cold-start bottleneck in SID-based generative recommendation. To overcome this bottleneck and bridge the disjoint objectives of tokenization and generation, we propose DREAM (Dynamic Refinement of Early Assignment Mappings), a three-stage framework that resolves this flaw through progressive refinement. First, an intent-aware tokenizer rebuilds the SID space through counterfactual contrastive learning, generating a diverse pool of behavior-aligned candidates per cold-start item. Second, the frozen recommendation backbone serves as an evaluator, selecting the most reliable candidate based on multi-context user support without retraining. Third, a dynamic beam mechanism maintains multiple weighted SID hypotheses throughout training and inference, preventing premature collapse to a single assignment. Extensive experiments on three Amazon benchmarks show that DREAM substantially outperforms state-of-the-art generative and sequential baselines on cold-start metrics.

[IR-13] owards Retrieving Interaction Spaces for Agent ic Search

链接: https://arxiv.org/abs/2606.06880
作者: Shengyao Zhuang,Yuansheng Ni,Hengxin Fun,Jimmy Lin,Xueguang Ma
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval for search agents is still inherited from non-agentic information retrieval: a retriever ranks the corpus and the agent reads a small set of returned documents. Recent direct corpus interaction (DCI) work shows that agents can instead interact with the raw corpus through shell tools such as grep and file reads. But unbounded interaction does not scale: every broad shell command is a scan over the whole corpus, and latency degrades sharply as the corpus grows. We argue that the role of retrieval for agentic search is not just to select documents that fit in the LLM context window, but to construct an interaction space: a bounded subset of the corpus the agent can explore with associated tools. Two design consequences follow. The space needs a boundary supplied by retrieval, and the objects within it should be processed for interaction. As a proof of concept, we propose RISE (Retrieving Interaction SpacE): we use BM25 to construct the interaction space; meanwhile, its documents are processed during indexing for shell-style navigation. On BrowseComp-Plus, RISE matches the pure-shell DCI baseline at 78% accuracy with gpt-5.4-mini at roughly one quarter of the per-query cost. At 1M documents, RISE-BM25 reaches 81% on gpt-5.4-mini, whereas DCI on gpt-5.4-nano degrades to 60% with 33 of 100 wall-clock failures.

[IR-14] A-RAG : Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication CIKM2026

链接: https://arxiv.org/abs/2606.06794
作者: Yong-Bin Kang,Anthony McCosker
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 5 pages, 5 figures, CIKM 2026 submission manuscript

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) successfully grounds large language model (LLM) outputs in trusted documents, but factual grounding alone is insufficient for sensitive peer-support health communication. In domains such as HIV peer support, responses must also be accessible, stigma-free, empathetic, and tailored to the recipient. This paper presents TA-RAG, a lightweight, prompt-based tone-aware RAG framework that embeds explicit tone control into a RAG pipeline without requiring model fine-tuning. We operationalise tone across four core components: stigma-free rewriting, readability adjustment, recipient adaptation, and empathy rephrasing. We evaluate TA-RAG through component-level tests using questions derived from HIV Online Learning Australia (HOLA), UNAIDS terminology guidance, readability metrics, peer-support standards from National Association of People with HIV Australia (NAPWHA), and a public empathy dataset. Results show that the TA-RAG’s components improve their targeted communication quality while preserving key content. These findings emphasise that prompt-based tone control is a potential direction for making RAG outputs suitable for sensitive peer-support health communication.

[IR-15] Mind the Gap: Bridging Behavioral Silos with LLM s in Multi-Vertical Recommendations

链接: https://arxiv.org/abs/2606.06779
作者: Nimesh Sinha,Raghav Saboo,Martin Wang,Sudeep Das
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In multi-vertical e-commerce platforms like DoorDash, relatively newer product verticals such as grocery and retail present a significant opportunity for personalization innovation. A key challenge lies in solving the “cold start” problem for users. This paper introduces a novel framework for enhancing recommendation quality by transferring knowledge from data-rich verticals (e.g., restaurants at DoorDash) to data-sparse ones. We leverage Large Language Models (LLMs) to perform generative inference, synthesizing sparse, high-dimensional features that encapsulate latent user affinities. Specifically, we employ a hierarchical Retrieval-Augmented Generation (RAG) pipeline to derive multi-level taxonomic features from user restaurant order histories and search queries. These generated features, encoding both long-term cross-vertical preferences and short-term intent, are integrated into a production Multi-Task Learning (MTL) ranking model. We demonstrate through extensive offline and online evaluation that this approach significantly improves personalization and engagement in emerging business verticals, effectively bridging the behavioral data gap.

[IR-16] rading Engagement for Sustainability: Carbon-Aware Re-ranking for E-commerce Recommendations RECSYS

链接: https://arxiv.org/abs/2606.04550
作者: Noah Lund Syrdal,Anders Vestrum,Jorgen Bergh
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 23 pages, 30 figures. Code available at this https URL

点击查看摘要

Abstract:E-commerce recommender systems strongly influence which products users consider and purchase, yet sustainability signals such as Product Carbon Footprint (PCF) are almost never available at catalog scale. We study carbon-aware product recommendation in the realistic setting where PCF labels are missing for most items and must be inferred. We first estimate product-level carbon footprints via a retrieval-augmented PCF estimation pipeline that transfers supervision from the Carbon Catalogue, a small set of life-cycle-assessed products, to a large unlabeled e-commerce catalog using semantic similarity search, few-shot LLM prompting, and a nearest-neighbour fallback. We then apply a carbon-aware post-hoc re-ranking strategy on top of relevance scores produced by three established recommendation models: BPR, NeuMF, and LightGCN. The method trades off predicted user-item engagement against estimated carbon footprint through a single tunable parameter, lambda. In this offline study, engagement is operationalized through Amazon review interactions, which serve as implicit feedback and as a proxy for user interest or purchase behavior. We evaluate the framework on the Amazon Reviews dataset across three product categories: Home and Kitchen, Sports and Outdoors, and Electronics. By sweeping lambda, we construct Pareto frontiers that characterize the achievable engagement and carbon trade-off for each model and category. Substantial carbon reductions are achievable at minimal engagement cost across all models and categories. However, the available carbon headroom varies by model and category, underscoring the importance of model choice and domain context.

人机交互

[HC-0] How reliable are LLM s when it comes to playing dice?

链接: https://arxiv.org/abs/2606.07515
作者: Luca Avena,Gianmarco Bet,Bernardo Busoni
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Probability (math.PR)
备注:

点击查看摘要

Abstract:We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

[HC-1] Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability

链接: https://arxiv.org/abs/2606.07437
作者: Chaitanya Shinde,Hadi Hajieghrary,Paul Schmitt,Adam Shoemaker,Bodo Seifert,Steve Kenner
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The ISO 26262 standard defines functional safety for road vehicles through risk assessments based on Severity, Exposure, and Controllability, grounded in a human-driven vehicle paradigm. In the context of autonomous vehicles (AVs), the absence of a human driver necessitates revisiting these principles. This paper decomposes the Controllability placeholder into two auditable evidence dimensions of ISO 26262 by introducing two measurable sub-concepts: Transferability and Predictability. Transferability extends Controllability to capture AV systems’ ability to hand off control to dedicated fallback safety mechanisms, while Predictability captures how easily external agents can anticipate AV behavior. Predictability is formally defined from human-robot interaction-inspired principles, and a mathematical framework is provided to quantify it. A designed-versus-achievable gap is introduced to distinguish architectural fallback claims from scene-conditioned achievable fallback capability. The proposed metrics align with ISO 26262 and ISO/PAS 21448 (SOTIF), rendering fallback and interaction claims falsifiable and traceable across ODD slices. These dimensions complement rather than replace existing standards, and the enhancements preserve the structure of ISO 26262 while extending its applicability to driverless automated systems operating at SAE Levels 4 and 5.

[HC-2] A Model of Integrated Information Processing in Human-AI Interaction

链接: https://arxiv.org/abs/2606.07283
作者: Tim Schrills. Thomas Franke
类目: Human-Computer Interaction (cs.HC)
备注: 22 pages

点击查看摘要

Abstract:For Human-AI Interaction (HAII) research to move forward, theoretical work linking psychological mechanisms to interface design is needed. Such work should extend rather than replace established HCI and automation research, adapting to the increasing autonomy and agency of AI systems. Building on prior frameworks focused on roles and levels in human interaction with automation, a gap remains from a psychological view: a task-centered, process-oriented account that links mechanisms of action regulation to concrete design and evaluation levers for human-AI coupling, expressed in a unified vocabulary for human and machine. Moreover, existing models may describe how a system is designed (e.g., function allocation in automation) but fall short in showing how this design affects human behavior. We present the Integrated Information Processing (IIP) model, a task-centered, cybernetic model that conceptualizes humans, machines, and their joint activity as coupled control loops. The IIP model uses a unified modeling language for human and artificial agents, making psychological models of action regulation accessible for AI system design. As a core feature, we argue that efficacy within a shared task is characterized by three integration qualities, input adequacy, reference consonance, and output operativity, which critically influence benchmarks of human-centeredness such as transparency and controllability. The model maps interface choices (e.g., XAI techniques) to theory-driven expectations of user behavior, guiding interface design and evaluation. To this end, we present (1) a continuity-preserving theoretical discourse that extends HAII to agency in AI; (2) the IIP model with three information-processing qualities; and (3) applications of the IIP model to exemplary use cases demonstrating implications for interface design.

[HC-3] Moodie: An Early-Stage Design Exploration for Supporting Fear of Missing Out with LLM -based Chatbots

链接: https://arxiv.org/abs/2606.07231
作者: Hsin-Yu Tsai,Jingxian Liao,Fu-Yin Cherng,Tzu-Hsiang Huang
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 1 figure, 1 table. Preliminary work submitted to the ACM CUI 2026 Works-in-Progress (WiP) track

点击查看摘要

Abstract:The excessive use of social media has led to the challenge known as Fear of Missing Out (FoMO). Existing studies fail to provide accessible, interactive tools that focus on the emotional and cognitive aspects of FoMO. This work presents Moodie, a chatbot designed using Large Language Models to support emotion regulation and reduce FoMO. We conducted a formative study to understand the needs of individuals with FoMO and developed Moodie. Then, we conducted a preliminary evaluative study (N=21) to observe how participants interact with Moodie and a baseline chatbot (GPT-4o) over one week. The results show that while both Moodie and a baseline chatbot reduced FoMO to a similar extent, Moodie resulted in greater engagement and social connection. This finding raises interesting questions about the advantages of purpose-built chatbots compared to general-purpose models for mental health support. Future research will include chat log analysis, prototype refinements, and longitudinal evaluations.

[HC-4] CANote: Empowering Fact-checking Note Writing Through Scaffolded and Provenance-based Human-AI Collaboration

链接: https://arxiv.org/abs/2606.07101
作者: Shuning Zhang,Jingruo Chen,Yuwei Chuai,Dai Shi,Yifan Wang,Xin Yi,Hewu Li
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Crowdsourced fact-checking mechanisms, such as X’s Community Notes, play a critical role in mitigating the spread of misinformation. However, drafting high-quality, evidence-based debunking notes imposes a substantial burden on contributors. We present CANote, an AI-assisted debunking note writing system featuring evidence correlation and structured co-drafting. CANote scaffolds the workflow by extracting subclaims from social media posts, providing provenance through explicit links between subclaims and retrieved evidence, and generating neutral, structural drafts to support human reasoning. We evaluated CANote against manual writing (N=52 fact-checkers, N=52 lay users) on simulated X platform, where we found CANote significantly improves note quality. Notably, CANote enables lay users to write notes that have comparable quality to those written by experts. While the task completion time and perceived cognitive load remain comparable to manual drafting, CANote significantly increases user satisfaction. However, this assistance introduces a trade-off, resulting in a reduced sense of user ownership and control over the debunking note.

[HC-5] A Multi-Operator Mixed-Reality Interface for Multi-Robot Control and Coordination: Co-Located and Private Workspace Collaboration

链接: https://arxiv.org/abs/2606.07013
作者: Omotoye Shamsudeen Adekoya,Antonio Sgorbissa,Carmine Tommaso Recchiuto
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Submitted to RO-MAN 2026

点击查看摘要

Abstract:Multi-operator control of robot teams requires not only access to the same mission information, but also mechanisms for maintaining shared awareness and preventing conflicting interventions. Building on our previous HORUS interface (Holistic Operational Reality for Unified Systems) we present a mixed-reality interface that extends single-operator multi-robot supervision to collaborative multi-operator use. The system supports two complementary modes: a co-located shared workspace, in which operators observe and manipulate the same mini-map in the same physical location, and a private-workspace mode, in which operators work on the same mission through independently placed local workspaces. The architecture combines registration-driven scene construction, lightweight shared-session synchronization, and per-robot control leases to support collaborative monitoring, tasking, and teleoperation while preventing conflicting commands. We evaluated the approach in a human-subject study with 36 participants (18 pairs) controlling three Nova Carter mobile robots in two search environments. The performance of the objective task was comparable across the two modes, indicating that both modes supported effective mission execution. However, the co-located shared workspace significantly improved perceived collaboration, shared understanding, and handoff clarity, and was the preferred collaborative mode. These results indicate that physically co-locating the MR workspace improves how operators coordinate even when the underlying robot-control tools remain unchanged.

[HC-6] Personality Anchoring for Social Simulation: Linking Personality Social Behavior and Interaction Success with LLM Agents

链接: https://arxiv.org/abs/2606.06936
作者: Vahid Sadiri Javadi,Aksa Aksa,Fryderyk Róg,Lucie Flek,Johanne R. Trippas
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Social interactions are shaped by the interplay of dispositional traits and situational context, yet systematically investigating how personality configurations between individuals jointly influence social behavior across diverse social contexts remains methodologically challenging. We address this gap by introducing a simulation pipeline adapted from the CHARISMA framework, which employs well-known movie characters and public figures as psychologically grounded agents for multi-LLM social simulation using a method we term personality anchoring. We present a large-scale empirical study examining how dyadic Agreeableness composition influences social interaction outcomes across 1,010 simulated conversations. Our results reveal a monotonic relationship between dyadic Agreeableness composition and shared goal achievement, with Homogeneous-Agreeable pairs achieving success 10 times the rate of Homogeneous-Disagreeable pairs (62% vs. 6%). Behavioral mediation analysis reveals that Agreeableness shapes goal achievement partially through cooperative strategy selection, though it continues to predict outcomes within the same dominant strategy, indicating pathways beyond observable conversational behavior. Robustness analyses confirm high consistency of results across repeated simulations (ICC = 0.89) and stable personality expression across diverse scenarios, validating personality anchoring as a viable operationalization strategy.

[HC-7] oward a Metaphysics of Learning Analytics: Ontological Positioning of Data Inference and Normativity

链接: https://arxiv.org/abs/2606.06851
作者: Kensuke Takii
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 25 pages, 1 figures

点击查看摘要

Abstract:The Learning Analytics (LA) community has undergone rapid development over the 15 years since the first LAK conference was held. However, while epistemological and ethical debates regarding the philosophical foundations of LA have been vigorous, metaphysical discussions have been sparse, signifying a lack of effort to derive the identity of LA from its internal principles. In this paper, we attempt to establish a metaphysics of LA by addressing the ontological question of ``What is LA?‘’ We do so by tracing back to LA’s own definitions and principles to derive an answer from within LA itself. Specifically, we address what kind of existence the data LA operates on constitutes, identify eight agents including learners as ontological prerequisites, and clarify, via the is/ought problem, that LA does not derive norms from data. In particular, this system reveals that a class of LA practices, here termed \textitnorm-embedded LA, conflates LA’s purpose with its operations, creating an ontological tension with the first principle. We also discuss connections with related fields and the limitations of this system. The metaphysics outlined here is not imposed from outside LA, but surfaces what LA itself has always implicitly presupposed.

[HC-8] Exploring Reinforcement Learning for Fluid Transitions Between Clinical Mental Healthcare and Everyday Wellness Support

链接: https://arxiv.org/abs/2606.06800
作者: Tony Wang,Qian Yang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mental health struggles wax and wane, yet clinical and wellness interventions typically operate separately, causing frequent breakdowns at care transitions. We explore reinforcement learning (RL) as a means to build digital health systems that deliver clinical and wellness interventions proactively, as part of a coherent care journey. We ask: what complexities does designing such a system involve? We built a contextual bandit that dynamically selects journaling prompts from clinical and wellness repertoires to optimize for an overarching health goal (sustained journaling) and deployed it in a four-week exploratory study (N=38). We found that, first, many benefits of RL-optimized intervention sequences appeared only after interventions ended, raising the question: Should systems that offer coherent clinical-wellness care journeys include stepping-back periods? If so, when and how? Second, participants most engaged with RL-generated interventions deepened their engagement over time, while those most engaged with a constant intervention tended to burn out and drop out later. It raises the question: When should a system blending clinical and wellness interventions reduce intensity to prevent burnout in versus sustain it to maximize treatment gains?

[HC-9] Explain Like Im 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

链接: https://arxiv.org/abs/2606.06788
作者: Indu Panigrahi,Tal August
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Preprint

点击查看摘要

Abstract:Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with 16 participants that tests models’ ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for 98 scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction 46% of the time. Our findings hold with increased sample size and alternative complexity levels.

[HC-10] Adversarial Co-Thinking: Calibration and Triangulation Across Multiple GenAI Tools in HCI Writing

链接: https://arxiv.org/abs/2606.06702
作者: Pia Tukkinen
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper examines what happens when GenAI tools are fully embedded in the drafting of an academic paper rather than confined to late-stage polishing. To investigate how an intensive multi-tool GenAI workflow differs from conventional academic writing, I drafted this paper from the first sentence in parallel with three GenAI tools - Claude, ChatGPT, and Gemini - comparing their outputs against my own intended contribution. Across this process, a recurring pattern took shape that I call adversarial co-thinking: using past peer reviews to calibrate the tools, then setting their outputs against one another to be tested rather than deferred to. I argue that surfacing genuine critique from tools that default to praise is a central practical challenge of working with these tools, and that the skill at stake is evaluative rather than generative. Adversarial co-thinking is a high-skill epistemic practice: it can amplify expertise where it exists, but it can also mask its absence. I further argue that current disclosure frameworks are poorly equipped to capture this shift. The paper offers four propositions for workshop discussion concerning autonomy, supervision, equity of access, and disclosure.

[HC-11] LinkNav: Surfacing Interconnected Information in Scientific Articles ACL2026

链接: https://arxiv.org/abs/2606.06650
作者: Sebastian Joseph,Jennifer Healey,Junyi Jessy Li,Ani Nenkova
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 3 figures, ACL 2026 (Demo Track)

点击查看摘要

Abstract:We present LinkNav, an enhanced experience for reading academic papers which makes explicit connections between related but non-adjacent passages. To create the experience, we instruct a language model to generate questions that may arise while reading a passage and then search for answer passages elsewhere in the document, forming intra-document connections when answers are found. We confirm that these building blocks work well to power the experience, with an answer detection pipeline that works with high precision, resulting in a reasonable number of connections being made for a document. On a dataset of academic papers, we find that connected passages are on average ten segments away from each other, making explicit connections that a reader may have otherwise missed.

[HC-12] Re-Centering Humans in LLM Personalization

链接: https://arxiv.org/abs/2606.06614
作者: Lechen Zhang,Jiarui Liu,Tal August
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Despite growing interest, most evaluations of large language models’ (LLMs’) personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

[HC-13] AI Level of Detail: Distance-Aware ML Model Precision Selection for Real-Time Human Motion Prediction in Games SIGGRAPH

链接: https://arxiv.org/abs/2606.06565
作者: Mathew Varghese
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Camera-ready for SIGGRAPH Technical Workshops 2026

点击查看摘要

Abstract:Modern game engines spend significant compute animating NPCs with learned motion models. This paper proposes AI Level of Detail (AI LOD), a framework in which machine learning inference precision is adapted based on the distance between each NPC and the player camera. The core idea mirrors classical geometry LOD: substitute a cheaper approximation where the difference is imperceptible. Here, the approximation is a lower-precision quantized machine learning model rather than a lower-polygon mesh. The contribution of this work is the AI LOD concept itself: that inference-time quantization can serve as the LOD axis for AI-driven character animation - and more broadly, for any AI-based runtime system where perceptual sensitivity varies with context. The convolutional sequence-to-sequence model of Li et al. is used as a representative example to demonstrate the concept, with its trained checkpoint exported into three ONNX Runtime variants (FP32, FP16, and INT8 per-tensor), intended to be routed by a distance-based selector at runtime. Evaluation on the CMU Mocap dataset provides initial evidence that each precision tier can be served at its assigned distance range with negligible perceptible degradation, supporting the broader premise that distance-aware ML model precision selection is a viable LOD strategy for AI-based character animation. Comments: Camera-ready for SIGGRAPH Technical Workshops 2026 Subjects: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) ACMclasses: I.3.7; I.2.6 Cite as: arXiv:2606.06565 [cs.GR] (or arXiv:2606.06565v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2606.06565 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-14] MacArena: Benchmarking Computer Use Agents on an Online macOS Environment ICML2026

链接: https://arxiv.org/abs/2606.06560
作者: Victor Muryn,Maksym Shamrai,Sofiia Mazepa,Yehor Khodysko
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

点击查看摘要

Abstract:Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple’s native Virtualization framework on Apple Silicon. We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. Notably, model rankings invert between ported and macOS-native tasks, with a leading model trailing by over 26% on the MacArena subset, suggesting that macOS poses a genuinely harder environment for current GUI agents.

[HC-15] Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers

链接: https://arxiv.org/abs/2606.06497
作者: Adam Cole,Mick Grierson
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: In review. 5 pages, 4 figures

点击查看摘要

Abstract:Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model’s material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call “material intimacy” with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model’s default representational space.

[HC-16] Sustainability by Design in Decentralized Autonomous Organizations: An Empirical Review of Governance Innovation and Institutional Design

链接: https://arxiv.org/abs/2606.05667
作者: Yutian Wang,Luyao Zhang
类目: Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Recent innovation theories on economics remain largely grounded in assumptions of hierarchical firms and closed organizational boundaries, offering limited insight into how innovation unfolds within decentralized, digitally native organizations. Decentralized Autonomous Organizations (DAOs) represent an emerging form of innovation ecosystem characterized by blockchain-based transparency, open participation, and token-driven governance, in which sustainability can be embedded directly into organizational design. This study compares two standards, ERC-8004 and Google A2A, who address the same agent interoperability question, while the former is governed by DAO and the latter by corporation consortium. They are examined through an LLM-powered comparative pipeline for large-scale governance discourse analysis, integrating automated annotation, neural topic modeling, and multi-layer network analysis to study socio-technical power structures. The study provides evidence-based insights for scholars, policymakers, and designers seeking to align innovation, technological governance, and sustainability in future organizational forms.

计算机视觉

[CV-0] UniSHARP: Universal Sharp Monocular View Synthesis

链接: https://arxiv.org/abs/2606.07514
作者: Meixi Song,Dizhe Zhang,Hao Ren,Ruiyang Zhang,Bo Du,Ming-Hsuan Yang,Lu Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we focus on extending SHARP, the popular photorealistic view synthesis method, for universal monocular rendering across a continuum of camera systems, from conventional perspective cameras to wide-field-of-view, fisheye and omnidirectional panoramic settings. To overcome the pinhole-specific assumptions of SHARP, our key idea is to align various images in a unified omnidirectional latent space. Thus, we propose UniSHARP, which performs implicit alignment in both feature and Gaussian spaces. Specifically, Gaussian primitives are arranged along rays and radial distances in a ray-based universal representation, while 2D semantic and 3D spatial features extracted from UniK3D-inspired encoders are jointly decoded to generate the complete Gaussian cloud. To comprehensively evaluate our method, we construct a benchmark covering diverse imaging systems across various scenes. The benchmark is further stratified by field of view (FoV) to enable fine-grained assessment of the universal monocular rendering task. Extensive experiments on the proposed benchmark demonstrate the effectiveness of UniSHARP, outperforming alternative methods by a large margin. The project page can be found at: this https URL

[CV-1] Streaming Video Generation with Streaming Force Control

链接: https://arxiv.org/abs/2606.07508
作者: Hanhui Wang,Yiming Xie,Haiwen Feng,Zhaoyang Lv,Shenlong Wang,Huaizu Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce StreamForce, a streaming video generation framework that enables physically grounded control through continuous force inputs. Unlike prior video models that train separate models for different force types, assume fixed forces, or rely on non-causal processing, StreamForce is a causal and unified model that responds instantly and coherently to both local and global, time-varying forces. To achieve this, we design a unified force representation as a control signal and develop a distillation pipeline for force-controllable video generation. Our model combines autoregressive efficiency with force responsiveness, sustaining stable photometric and dynamic realism. StreamForce runs at up to 16.6 FPS on a single GPU, achieving state-of-the-art performance in both force adherence and motion realism. Project website: this https URL

[CV-2] Differences in Detection: Explainability Where it Matters CVPR

链接: https://arxiv.org/abs/2606.07503
作者: Johannes Theodoridis,Johannes Maucher,Andreas Schilling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2026 - How Do Vision Models Work? (HOW)

点击查看摘要

Abstract:We propose Differences in Detection (DnD), an intuitive method to compare two object detection models. Based on the same matching algorithm, it complements the standard metrics of mean Average Precision ( mAP ) and TIDE error analysis with the ability to compare two models directly. More specifically, we calculate the intersection of ground truth labels that are recognized by both models, followed by the corresponding difference sets and the complement set of ground truth labels that are missed by both models. The resulting comparison is more direct and intuitive than a comparison of independent summary statistics. It reveals individual and shared mistakes and becomes particularly interesting when combined with error types. In this case, the differences in detection errors can be analyzed naturally in a standard confusion matrix. While valuable in itself, we believe that one of the best applications of DnD is to guide explainability methods such as ODAM towards metric-relevant examples, grounded in structured subsets. The code for our method is available here: this https URL

[CV-3] Implicit Data Synthesis for Contrastive Unsupervised Data Augmentation

链接: https://arxiv.org/abs/2606.07498
作者: Patrick Kage,Trevor Hedges,N. Siddharth,Pavlos Andreadis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Scientific observations generate large quantities of unlabeled data which is laborious to hand-label, making unsupervised learning techniques valuable for processing datasets. Among these approaches, contrastive learning provides a convenient mechanism for extracting structural representations from unannotated datasets. For natural imagery, the general approach is to use a variety of data-space augmentation methods in order to generate synthetic samples; however, for scientific observations data-space perturbations can fundamentally alter the underlying data. Our proposed method is to generate contrastive samples by perturbing the network weights rather than the underlying data, thus more closely preserving the structure of the data. We demonstrate this technique using a SimCLR-based pipeline applied over radar observations of meteors, and show performance gains under matched protocols.

[CV-4] Planning -aligned Token Compression for Long-Context Autonomous Driving

链接: https://arxiv.org/abs/2606.07464
作者: Zhixuan Liang,Yuxiao Chen,Yurong You,Peter Karkus,Wenhao Ding,Boyi Li,Alexander Popov,Yan Wang,Maximilian Igl,Yiming Li,Danfei Xu,Nikolai Smolyanskiy,Boris Ivanovic,Ping Luo,Marco Pavone
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve 6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.

[CV-5] Skill-3D: Evolving Scene-Aware Skills for Agent ic 3D Spatial Reasoning

链接: https://arxiv.org/abs/2606.07436
作者: Haoyuan Li,Zhengdong Hu,Jun Wang,Hehe Fan,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent’s tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.

[CV-6] Watch Remember Reason : Human-View Video Understanding with MLLM s

链接: https://arxiv.org/abs/2606.07433
作者: Jiahao Meng,Yue Tan,Qi Xu,Kuan Gao,Weisong Liu,Yanwei Li,Jason Li,Lingdong Kong,Haochen Wang,Qianyu Zhou,Jiangning Zhang,Guangliang Cheng,Yunhai Tong,Lu Qi,Minghsuan Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at this https URL.

[CV-7] OpenGlass: Open-Source Smart Glasses for On-Device Event-Based Gesture Recognition

链接: https://arxiv.org/abs/2606.07431
作者: Pietro Bonazzi,Julian Moosmann,Ahmet Celik,Philipp Mayer,Michele Magno
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Smart eyewear enables unobtrusive, context-aware interaction through multimodal sensors and on-device intelligence, but is severely limited by power, memory, and compute constraints in a compact form factor. Open-hardware platforms supporting event-based vision and embedded ML at this scale are rare. This work introduces an open-source smart glasses platform for rapid prototyping of novel sensors and algorithms. Its modular design uses a flexible FPC interposer to support both event-based and frame-based cameras without full PCB redesign. A hardware-software co-designed power management system combines a configurable PMIC with event-driven wake-up via an nRF5340 coordinator, keeping the GAP9 RISC-V SoC powered down between inferences. The prototype achieves up to 11.8 hours of continuous on-device ML from a 200 mAh battery. As a demonstration, an egocentric hand gesture recognition pipeline was evaluated on the LynX dataset using polarity-separated event histograms from a Prophesee GENX320 camera. R(2+1)D achieved the best cross-subject accuracy of 83.94% (macro F1 = 0.781) under leave-two-subjects-out validation, with 33.9 ms end-to-end latency on the GAP9. Temporal augmentation and removal of ambiguous classes provided the largest gains (+8.9 pp). All hardware designs, firmware, and models are released open source.

[CV-8] DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

链接: https://arxiv.org/abs/2606.07419
作者: Tony Danjun Wang,Tolga Birdal,Nassir Navab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

[CV-9] RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

链接: https://arxiv.org/abs/2606.07401
作者: Ameya Joshi,Joon Kim,Gus Eggert,Joseph Bajor,Cindy Hao,Jing Reyhan,Kushal Byatnal,Eli Badgio
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.

[CV-10] Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

链接: https://arxiv.org/abs/2606.07394
作者: Danial Hamdi,Fardin Ayar,Mahdi Javanmardi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS’s core challenge: robust long-term temporal association.

[CV-11] Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge

链接: https://arxiv.org/abs/2606.07368
作者: Marc Aubreville,Jonas Ammeling,Sweta Banerjee,Viktoria Weiss,Taryn A. Donovan,Robert Klopfleisch,Jiaqi Lv,Shan E Ahmed Raza,Raphaël Bourgade,Thomas Walter,Yasemin Topuz,Songül Varlı,Charles-Antoine Collins-Fekete,Zhuoyan Shen,Navya Sri Kelam,Nitin Singhal,Christian Marzahl,Brian Napora,Tengyou Xu,Hongyan Gu,Mario Vento,Gennaro Percannella,Norbert Ropiak,Izabela Wasiak,Jie Xiao,Shaojun Liu,Seungho Choe,April Khademi,Vidushi Walia,Sujatha Kotte,Andrew Broad,Alex Wright,Guillaume Balezo,Esha Sadia Nasir,Mostafa Jahanifar,Yosuke Yamagishi,Shouhei Hanaoka,Mattia Sarno,Francesco Tortorella,Biwen Meng,Jingxin Liu,Sara Krauss,Daniel Hieber,Lavish Ramchandani,Dev Kumar Das,Mieko Ochi,Yuan Bae,Piotr Giedziun,Mateusz Maniewski,Vangala Govindakrishnan Saipradeep,Naveen Sivadasan,Leire Benito-Del-Valle,Adrian Galdran,Kaustubh Atey,Sameer Anand Jha,Adinath Dukre,Imran Razzak,Maxime W. Lafarge,Viktor H. Koelzer,Nils Porsche,Nikolas Stathonikos,Mitko Veta,Dominik Hirling,Zsanett Zsófia Iván,Peter Horvath,Katharina Breininger,Christof A. Bertram
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical “real-world” application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting “blind spots” in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that “in the wild” mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.

[CV-12] Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

链接: https://arxiv.org/abs/2606.07366
作者: Anurag Ghosh,Francesco Pittaluga,Khiem Vuong,Angela Chen,Juan Alvarez-Padilla,Manmohan Chandraker,Srinivasa Narasimhan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

[CV-13] Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition MICRO IJCAI2026

链接: https://arxiv.org/abs/2606.07355
作者: Xucheng Shen,Kun Li,Fei Wang,Wei Qian,Jin Jiang,Dan Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. 1st Place in Micro-gesture Online Recognition in 4th MiGA at IJCAI 2026

点击查看摘要

Abstract:Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

[CV-14] VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

链接: https://arxiv.org/abs/2606.07338
作者: Zikai Zhang,Hubert P. H. Shum,Toby P. Breckon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.

[CV-15] Varifold Moment Invariants for Sustainable and Explainable Contour Feature Extraction

链接: https://arxiv.org/abs/2606.07333
作者: G. Longari,J.-C. Alvarez Paiva,A.B. Tumpach
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 12 figures

点击查看摘要

Abstract:We introduce Varifold Moments Invariants (VMI) as a unifying framework for many previously introduced Moment Invariants. These invariants are deeply related to other contour features that are invariant under translations and rotations, like Extended Gaussian Image, Elliptic Fourier Descriptors or Shape Distributions. The advantage of the varifold approach to moments consists in being able to combine the geometry of the region, its boundary, and the family of lines tangent to it, in order to create a substantial number of invariant features with high discriminating power and clear geometric meaning. By coupling our VMI feature extraction with the light feature classifiers Random Forest or Multi-Layer-Perceptron, we outperform state-of-the-art approaches based on contours, while decreasing drastically the computational cost to the point of allowing our algorithm to run on light devices. We tested our approach on classification tasks on a large number of widely-used datasets of various types (leaves, objects, cells) and achieved high accuracy with a low number of geometrically interpretable features.

[CV-16] AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

链接: https://arxiv.org/abs/2606.07326
作者: Yu Li,Menghan Xia,Gongye Liu,Xintao Wang,Conglang Zhang,Lei Ke,Yuxuan Lin,Ruihang Chu,Pengfei Wan,Kun Gai,Yujiu Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexible mechanism for world customization. First, we utilize 3D human motion as the primary interaction modality. To complement the out-of-view or truncated body parts in egocentric views, we introduce an auxiliary training supervision that incorporates exogenous viewpoints decoupled from the agent’s first-person sensorium. It allows the model to observe the agent’s full-body positioning relative to the environment, facilitating a more robust spatial grounding of human-world interactions. Furthermore, we propose a simple yet effective mechanism for customizing self-evolving worlds. This is achieved by defining anchor views within a unified world coordinate system, coupled with textual descriptions dictating the dynamic evolution of local scenes. Experimental results show that AnchorWorld significantly outperforms state-of-the-art baselines, while ablation studies validate the effectiveness of our key designs. Notably, our customization scheme exhibits promising spatio-temporal geometric consistency and adheres strictly to the prescribed evolutionary dynamics.

[CV-17] CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

链接: https://arxiv.org/abs/2606.07311
作者: Anku Rani,Wei Dai,Shravan Nayak,Pattie Maes,Mahdi M. Kalayeh,Paul Pu Liang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8% overall CultureScore, with Behavior the most challenging dimension, which remains below 52% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.

[CV-18] Closed-Form Spectral Regularization for Multi-Task Model Merging

链接: https://arxiv.org/abs/2606.07289
作者: Yongxian Wei,Runxi Cheng,Xingxuan Zhang,Li Shen,Chun Yuan,Peng Cui,Dacheng Tao
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models. State-of-the-art merging methods formulate merging as a layer-wise quadratic interference minimization problem. Although this problem admits an exact closed-form pseudoinverse solution, that solution underperforms hundreds of iterations of gradient descent in practice. The iterative loop dominates the cost of the pipeline, yet its effectiveness has remained unexplained. We revisit this regime and show that the iterative solver does not primarily act as an optimizer; rather, it serves as an implicit spectral regularizer for an ill-posed normal equation, where small-eigenvalue directions of the per-layer interference operator amplify proxy noise. Building on this finding, we formalize multi-task model merging as a noisy linear inverse problem and propose a spectral filtering estimator parameterized by a per-direction filter. We instantiate this estimator with SWUDI, a closed-form method that combines a soft exponential filter, which matches the gradient-flow trajectory of iterative descent, with a hard top-K truncation that suppresses noise-amplifying small-eigenvalue directions. Furthermore, we propose SWUDI-A, an adaptive variant that replaces the global rank hyperparameter with per-layer rank rules, further improving robustness across architectures. Both variants share a single symmetric eigendecomposition per linear layer and require no training data or optimizer state. Across four general benchmarks and a multimodal merging benchmark spanning VQA, Geometry, Chart, OCR, Grounding, and modality merging, our proposed spectral solvers match or outperform state-of-the-art merging methods. Crucially, they reduce wall-clock time by 28-72x and peak GPU memory by up to 50%.

[CV-19] ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation CVPR2026

链接: https://arxiv.org/abs/2606.07288
作者: Chuanjin Fan,Lifan Wu,Wenjie Chang,Hanzhi Chang,Wenfei Yang,Tianzhu Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

点击查看摘要

Abstract:Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose ExMesh, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with real-time UV maintenance, to enable coarse-to-fine optimization while preserving geometric integrity. To our knowledge, ExMesh is the first framework to seamlessly integrate discrete topology operations into a continuous differentiable optimization pipeline. Extensive experiments demonstrate that ExMesh achieves a balance among accuracy, computational efficiency, and mesh conciseness.

[CV-20] Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation CVPR

链接: https://arxiv.org/abs/2606.07280
作者: Zihao Zhang,Aming Wu,Yang Li,Yahong Han,Jialie Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

点击查看摘要

Abstract:Novel class discovery in point cloud segmentation aims to transfer knowledge from known classes to automatically identify and segment unlabeled novel classes in point clouds. Existing methods mainly rely on pairwise associations for class assignment and novel class reasoning, which limits their ability to capture complex relationships among known and novel classes and may lead to inaccurate semantic segmentation. To address this issue, we introduce a hypergraph-based framework that models high-order associations among classes and enables collaborative reasoning from known classes to novel classes beyond traditional pairwise relations. Moreover, existing methods tend to focus on semantic feature extraction while paying insufficient attention to geometric information in point clouds. To better exploit spatial structure, we propose Geometric-Aware Prototypes to enhance the representation of class-level geometric cues. By propagating geometric information through hyperedges, the proposed method improves the understanding of spatial distributions across classes and leads to more accurate segmentation. Experiments on the SemanticKITTI and SemanticPOSS datasets demonstrate the effectiveness and superiority of our method.

[CV-21] Reconstructing Multi-Decadal Forest Disturbances: A Spatio-Temporal Transformer Approach

链接: https://arxiv.org/abs/2606.07249
作者: Linus Scheibenreif,Anton Raichuk,Maxim Neumann
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate monitoring of forest disturbances is essential for understanding carbon dynamics and land management, yet traditional approaches typically rely on pixel-wise analysis of satellite time-series, ignoring spatial context. We present a deep learning framework that maps 38 years (1984-2022) of forest disturbance across the contiguous United States by modeling temporal trajectories and spatial neighborhoods simultaneously. By leveraging a vision transformer architecture, our approach effectively filters noise from weak supervision signals to produce spatially coherent disturbance maps. We perform exhaustive evaluations across multiple satellites (Landsat, Sentinel-1, Sentinel-2) and temporal windows (38 years and the more recent 6 years), validating performance against a novel, manually annotated validation dataset (n=300) and independent fire perimeter dataset (n=706). The results highlight the complexity of the task: while our spatio-temporal model demonstrates high precision (up to 98.2% for ±1 year detection on MTBS and up to 71.3% on the CONUS validation datasets, with F1-scores up to 75.8% and 47.3%, respectively) and effectively reduces spatial artifacts, it exhibits performance trade-offs across different disturbance regimes compared to pixel-wise baselines. Our method offers a promising foundation for consistent forest monitoring.

[CV-22] Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

链接: https://arxiv.org/abs/2606.07244
作者: Haoxiang Shi,Xiang Deng,Haoyu Zhang,Qiaohui Chu,Yaowei Wang,Liqiang Nie
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.

[CV-23] Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking

链接: https://arxiv.org/abs/2606.07233
作者: Eduardo Borges,Luís Garrote,Urbano J. Nunes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

点击查看摘要

Abstract:LiDAR-based 3D Multi-Object Tracking (MOT) typically relies solely on geometric information, which is often insufficient to distinguish between targets during prolonged occlusions or in crowded human-populated environments. While integrating RGB-based Re-Identification (ReID) offers a theoretical solution for preserving identity context, existing approaches often rely on computationally expensive parallel detectors that hinder real-time robot responsiveness. This work presents a systematic study of image-based ReID in online 3D MOT, utilizing a lightweight projection-based framework to decouple geometric and appearance modeling for mobile robots. A comprehensive analysis of feature extraction architectures is conducted, employing lightweight CNNs and Vision Transformers, and evaluating various multi-modal data association strategies to balance computational latency with robust tracking. Experiments on the Pedestrian class of the KITTI dataset reveal that naive linear fusion, of appearance and motion costs, degrades performance due to visual noise. Conversely, a cascaded matching strategy successfully recovers occluded tracks without compromising overall precision, effectively preventing identity switches to maintain human-robot interaction continuity. We show that lightweight architectures can offer an optimal trade-off between the low latency required for safe navigation and the discriminative power needed for social awareness.

[CV-24] DualGate-Net: A Prior-Gated Dual-Encoder Framework for Histopathology Cell Detection

链接: https://arxiv.org/abs/2606.07222
作者: Bahman Jafari Tabaghsar,Son Tran,K. Devaraja,Atul Sajjanhar
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Cell detection in histopathology images strongly depends on surrounding tissue context, where visually similar cells may belong to different classes under different microenvironments. Recent tissue-aware methods incorporate contextual priors, but often rely on static fusion strategies that may propagate noisy information. In this work, we propose DualGate-Net, a prior-aware dual-encoder framework that combines a ConvNeXtV2-based local encoder and a SegFormer-based global encoder through a learnable prior-gated fusion mechanism. The proposed module adaptively regulates the influence of tissue priors across spatial locations, while an auxiliary foreground reconstruction branch preserves high-frequency cellular structures during training. In addition, auxiliary cellness-guided cues are incorporated to further improve localization robustness. Experiments on the OCELOT benchmark demonstrate consistent improvements, achieving macro F1-scores of 0.7722 on the validation set and 0.7345 on the test set, highlighting the effectiveness of adaptive prior integration for robust histopathology cell detection.

[CV-25] Robotic Policy Adaptation via Weight-Space Meta-Learning

链接: https://arxiv.org/abs/2606.07217
作者: Christian Bianchi,Siamak Yousefi,Alessio Sampieri,Andrea Roberti,Luca Rigazio,Fabio Galasso,Luca Franco
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2606.07217 [cs.RO] (or arXiv:2606.07217v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.07217 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-26] AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

链接: https://arxiv.org/abs/2606.07185
作者: Xiaocheng Lu,Yuxi Chen,Jie Zhang,Jian Liu,Jingcai Guo,Fangqi Zhu,Tao Han,Song Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint; 11 pages, 4 figures

点击查看摘要

Abstract:Image tokenizers, from 2D grids to recent 1D sequences, typically encode every image with the same fixed number of tokens. Yet visual complexity is highly heterogeneous, so a uniform budget overspends on simple inputs and underserves complex ones. Existing elastic tokenizers expose variable-length reconstructions, but often leave token length as a deployment-time operating point, a search target, or an external prediction rather than an output of the tokenizer itself. In this work, we ask whether a discrete visual tokenizer can budget itself in one pass. Our central finding is that actionable elasticity requires a representation–allocation co-design: prefixes must remain decodable across budgets, and the tokenizer must learn which prefix each image needs. We propose AdaTok, a self-budgeting discrete 1D tokenizer. AdaTok combines Prioritized Representation Learning, which orders tokens with nested tail masking and resolves budget-dependent semantic shift through Multi-Head LoRA decoder heads, with Adaptive Token Allocation, which trains a lightweight deterministic-group GRPO policy over candidate budgets. Dynamic Pareto Weighting balances fidelity and efficiency during policy training without manual trade-off sweeps. On ImageNet-1K, AdaTok-Full reaches rFID 1.31 at 256 tokens, while AdaTok-Adaptive attains rFID 1.50 using only ~118 tokens on average, outperforming discrete 1D baselines at comparable budgets. In autoregressive image generation, the shorter adaptive representation yields ~2.1x throughput over a fixed 256-token decode, suggesting that visual token count can be learned as a content-conditioned output rather than set as a fixed hyperparameter.

[CV-27] OPTIMUS-Prime: Minimal and Sufficient Concept Explanations for Deep Vision Models

链接: https://arxiv.org/abs/2606.07180
作者: Arthur Hoarau,Chenrui Zhu,Vu Linh Nguyen
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The growing demand for transparency in automated decision-making has propelled eXplainable Artificial Intelligence (XAI) to the forefront of machine learning research. In computer vision, however, existing explanation methods often prioritize end-user accessibility at the expense of formal guarantees, leaving a critical gap between practical utility and theoretical rigor. In this paper, we address this gap by introducing OPTIMUS, a novel framework for generating concept-based visual explanations for deep classification models. OPTIMUS explanations take the form of visual heatmaps that not only remain interpretable to end users, but are grounded in the well-established theory of prime implicants, providing formal guarantees that have been largely absent from existing saliency-based methods. Specifically, OPTIMUS explanations satisfy two desirable properties: sufficiency, ensuring that the highlighted concepts provably guarantee the classifier’s prediction, and minimality, ensuring that no strict subset of those concepts retains this guarantee. Together, these properties yield explanations that are both logically tight and visually coherent. We validate our approach on a visual classification benchmark, demonstrating that OPTIMUS heatmaps naturally and faithfully surface the decision-relevant concepts underlying model predictions.

[CV-28] EvoGS: Constructing Continuous-Layered Gaussian Splatting with Evolution Tree for Scalable 3D Streaming

链接: https://arxiv.org/abs/2606.07179
作者: Yuang Shi,Simone Gasparini,Géraldine Morin,Wei Tsang Ooi
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Project page: this https URL

点击查看摘要

Abstract:Streaming 3D Gaussian Splatting requires highly scalable, progressive representations. Existing progressive methods rely on \textitdiscrete layering, accumulating separate splat sets for each level of detail. This structural independence between layers inherently leads to error accumulation, severe splat redundancy, and uncontrolled quality transitions. We propose EvoGS, the first \textitcontinuous-layering representation. Organized as an Evolution Tree, EvoGS generates finer details via an explicit, wavelet-inspired parent-child refinement. This empowers child nodes to structurally correct ancestral errors, yield inherently sparse and highly compressible inter-layer signals. Extensive experiments show EvoGS eliminates splat redundancy from over 65% to under 25%. Compared to state-of-the-art baselines, it reduces transmission payload and GPU VRAM footprint by up to 2.4 \times and 5.5 \times , respectively, and achieves smooth quality transitions optimal for real-time adaptive streaming. Project page: this https URL

[CV-29] Seeing Without Exposing: Adaptive Privacy Control for Open-World Context-Hungry MLLM s

链接: https://arxiv.org/abs/2606.07175
作者: Siyuan Xu,Yibing Liu,Peilin Chen,Yung-Hui Li,Shiqi Wang,Sam Kwong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have raised new privacy challenges. On the data side, user-provided inputs often include unpredictable sensitive information; while on the downstream task side, model reasoning depends on rich visual context that may itself be privacy-sensitive. Existing privacy protection methods, however, rely on predefined sensitive categories and fixed obfuscation strategies, struggling to tackle such challenges in MLLMs. To address this dilemma, we propose Anchored Privacy Drifting (APD), a training-free method that drifts privacy-sensitive elements toward semantically equivalent alternatives while anchoring contextual cues to the source image. To systematically evaluate this dual objective of privacy protection and contextual preservation, we introduce AdaptShield, a comprehensive benchmark covering 22 privacy categories, which combines conventional privacy metrics with MLLM-based assessments of contextual utility. Extensive experiments show that our method achieves balanced improvements in both privacy sanitization and content retention, with average gains of 10.4% on textual categories and 8.5% under MLLM-based evaluation across four MLLM series, i.e., Qwen2.5, Qwen3, InternVL3, and InternVL3.5.

[CV-30] When Recovery Matters: The Blind Spot of Surrogate Privacy in MLLM Editing

链接: https://arxiv.org/abs/2606.07171
作者: Siyuan Xu,Yibing Liu,Peilin Chen,Yung-Hui LI,Shiqi Wang,Sam Kwong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) enable flexible instruction-driven image editing, but privacy risks arise when user images expose diverse and user-specific private content. Canonical privacy protection strategies typically substitute sensitive regions with surrogate content before cloud editing. Yet, the resulting output is often an edited surrogate rather than the desired edited source image, neglecting the local recovery in both design and evaluation scope. To this end, we introduce SPPE (Surrogate-based Privacy-Preserving Editing), the first recovery-oriented benchmark covering 36 fine-grained privacy categories and 65 editing instructions. It defines two complementary tasks: 1) editability assessment, which estimates before cloud interaction whether a surrogate can induce an edit consistent with the original image; and 2) surrogate-to-source edit recovery, which evaluates whether the edited surrogate can be transferred back to the private source with the edit effect preserved. We address each task with a dedicated method: ERMA predicts surrogate editability through instruction-aware multimodal relation modeling, while \method performs cycle-consistent recovery by using the surrogate editing pair as visual edit evidence and the source image as a source-preserving anchor. Experiments on SPPE and InstructPix2Pix show consistent improvements on both tasks. For editability assessment, ERMA improves over the best-performing baselines by 13.9% in SRCC and 12.3% in PLCC. For surrogate-to-source edit recovery, C2E-S2SER outperforms SOER across all 8 source integrity and edit consistency metrics on SPPE.

[CV-31] raRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

链接: https://arxiv.org/abs/2606.07161
作者: Duc Tri Tran,Trung Thanh Nguyen,Vijay John,Phi Le Nguyen,Yasutomo Kawanishi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22nd IEEE International Conference on Advanced Visual and Signal-Based Systems

点击查看摘要

Abstract:Video Text Spotting (VTS) is essential for urban surveillance and intelligent transportation systems, enabling automated reading of street signs, vehicle markings, and scene text in video streams. However, reliable recognition remains challenging due to dynamic video factors common in surveillance scenarios, including motion blur, occlusion, and scale variation, which degrade frame-level recognition. Existing VTS methods typically perform recognition independently on each frame, leading to inconsistent and inaccurate results across sequences. To address these limitations, we propose TraRA (Trajectory-level Recognition Aggregation for VTS), a plug-and-play method that performs trajectory-level text recognition by leveraging temporal and multimodal consistency. TraRA integrates two key modules: (1) the Temporal Clustering and (2) the Vision-Language Aggregation. The former refines noisy trajectories by grouping temporally and visually coherent text instances, while the latter employs a Low-Rank Adaptation-enhanced Vision-Language model to fuse visual cues with linguistic context across frames. By aggregating information over entire text trajectories, TraRA achieves robust text recognition even under challenging surveillance conditions. Extensive experiments on four public benchmarks, including road and urban scene datasets (RoadText, BOVText, ArTVideo, and ICDAR15), demonstrate that TraRA consistently improves tracking and recognition performance over state-of-the-art VTS methods. The source code is available at this https URL.

[CV-32] Consistent-Inversion: Reverse Consistency Guidance for Structure-Preserving Visual Editing

链接: https://arxiv.org/abs/2606.07145
作者: Xiaocheng Lu,Jingcai Guo,Song Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Multimedia; 10 pages, 4 figures

点击查看摘要

Abstract:Text-guided diffusion models have become effective tools for real-image visual editing, where the edited image must follow a target instruction while preserving editing-irrelevant structure. Most training-free editors rely on inversion: a source image is mapped to a noisy latent trajectory and the terminal latent is reused for target-prompt denoising. This reuse is useful for preservation, but it also couples source reconstruction and target editing. The resulting trajectory mismatch may either damage background/layout details or over-constrain the intended edit. This paper presents Consistent-Inversion, a training-free reverse consistency guidance framework for structure-preserving visual editing. Instead of treating the inverted source latent as a fixed initialization, Consistent-Inversion checks whether an intermediate target trajectory can be reversed toward the source inversion trajectory under the source prompt. To make this check well-defined, we construct an auxiliary target-side noise representation, perform source-guided reverse denoising, and use the resulting reverse consistency discrepancy as a correction signal for selected early target denoising steps. The method does not update model parameters, is compatible with inversion-based editors, and introduces only a small inference overhead when applied sparsely. Experiments on PIE-Bench show that Consistent-Inversion improves background and structural fidelity under a unified SD3.5 protocol while maintaining target-prompt alignment, and compatibility experiments further verify the same correction principle on classical Stable-Diffusion inversion pipelines.

[CV-33] Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment

链接: https://arxiv.org/abs/2606.07117
作者: Yibo Liu,Ziwei Zhang,Haozhou Pang,Menghao Li,Lanshan He,Gan Qi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents Native3D, the first end-to-end 3D scene generation framework that completely bypasses 2D intermediate representations. Traditional approaches typically require adapting 3D representations to the 2D domain to leverage pre-trained diffusion models, which inevitably introduces domain adaptation issues including geometric structural distortion and texture detail degradation. To address these limitations, we design a unified mesh-texture joint representation that simultaneously models both geometric structures and texture features through a Transformer-based scene encoder, effectively maintaining spatial relationships and visual consistency among objects within scenes. We further propose the 3D Representation Alignment Loss (3D REPA Loss), which employs an improved contrastive learning mechanism to align multi-level semantic representations in the latent space, significantly enhancing geometric and textural fidelity. Experimental results demonstrate that Native3D outperforms existing methods in both generation quality and editing flexibility, providing a novel solution for 3D scene editing.

[CV-34] 3DMorph: Single-Image-Guided Local 3D Shape Editing and Morphing IJCNN2026

链接: https://arxiv.org/abs/2606.07115
作者: Tobias Preintner,Yunfei Deng,Phillip Müller,Sebastian Illing,Adrian König,Thomas Bäck,Elena Raponi,Niki van Stein
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to IJCNN 2026

点击查看摘要

Abstract:Despite recent progress in 3D generation, intuitive editing of existing shapes remains limited. Unlike images, which benefit from well-established inpainting tools, general 3D objects such as meshes still lack simple and effective methods for local shape editing. Existing approaches are often global, domain-specific, require complex user interaction, or focus on appearance (color and texture) rather than geometry. We introduce 3DMorph, a training-free framework for single-image-guided local 3D shape editing and morphing. Given an edited image showing a desired shape modification, our method automatically localizes the relevant 3D region and transfers 2D modifications to 3D while preserving unmodified areas. 3DMorph also enables intermediate shape generation between the original and edited objects, facilitating design exploration. To benchmark editing quality, we introduce Delta3D, an image-guided local 3D editing benchmark with paired ground-truth edits. Experimental results show that 3DMorph translates intuitive 2D edits into 3D, outperforming state-of-the-art generative and editing methods.

[CV-35] GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection IJCNN2026

链接: https://arxiv.org/abs/2606.07102
作者: Taisei Saito,Koretaka Ogata,Takafumi Hiroi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, Accepted at IJCNN 2026

点击查看摘要

Abstract:We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) uncertainty modeling for few-shot classification and out-of-distribution (OOD) detection. While CLIP achieves strong zero-shot recognition, it yields deterministic similarity scores and offers limited uncertainty information, which is critical under distribution shift and data scarcity. GP-Adapter constructs modality-specific, class-wise one-class GPs on top of frozen CLIP embeddings using an RBF kernel for image features and a linear kernel for text prompts and fuses their predictive statistics to produce a variance-aware confidence score for OOD detection. The method requires no fine-tuning of the CLIP backbone and relies only on a small K -shot cache and lightweight hyperparameter selection, with memory cost scaling as O(CK^2) for C classes and K shots. Experiments on ImageNet and multiple OOD benchmarks show that GP-Adapter provides competitive few-shot performance and consistently improves OOD detection when combined with prompt-learning baselines, highlighting the complementarity between GP-based uncertainty modeling and prompt learning. Overall, our results suggest that integrating probabilistic inference with large pre-trained vision-language models can improve reliability in low-data and distribution-shifted settings. Code is available at this https URL

[CV-36] LARA: Latent Action Representation Alignment for Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.07100
作者: Mengya Liu,Baoxiong Jia,Jiangyong Huang,Jingze Zhang,Siyuan Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.

[CV-37] Detecting Temporally Localized Manipulations in Authentic Video Streams

链接: https://arxiv.org/abs/2606.07090
作者: Okan Umur,Ali Emre Güşlü,Ibrahim Delibasoglu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of video editing and generative artificial intelligence technologies has made realistic video manipulation increasingly accessible. Although existing datasets have significantly advanced research in deepfake detection, object removal, and video inpainting, they do not adequately model scenarios in which a short manipulated segment is inserted into an otherwise authentic video and the original video continues afterward. In this study, we review representative datasets from the literature, analyze their characteristics, and discuss their limitations with respect to temporally localized realistic manipulation detection. Based on this analysis, we motivate the need for a new dataset specifically designed for authentic videos containing short and highly realistic manipulated intervals. Finally, we evaluate two complementary approaches on our custom-curated test set to establish an initial benchmark for this challenging scenario. The first employs a linear probe on DINOv3 features, assessed under three thresholding strategies. The second leverages DINOv3 features with a consecutive frame similarity-based method to detect temporal manipulation boundaries. Together, these experiments provide an initial benchmark for partially manipulated video detection and highlight the need for content-adaptive thresholding mechanisms. The dataset, code, and supplementary materials are publicly available at this https URL.

[CV-38] An Adaptive Data cleaning Framework for Noisy Label Detection

链接: https://arxiv.org/abs/2606.07086
作者: Chen-Hsuan Fang,Wei-Hsinag Chen,Pin-Hsuan Yu,Jung-Hua Wang,Tsung-Wei Pan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) excel in computer vision tasks given large annotated datasets. In real-world applications, however, labels are often corrupted by ambiguity, human error, or dynamic environments. Over-parameterized DNNs easily memorize these noisy labels during training, degrading model accuracy and generalization. Existing data-cleaning and sample-selection strategies often rely on manually specified thresholds, prior knowledge of the noise ratio, or a single metric (either learning dynamics or geometric structure), making them unstable in complex data regimes. This paper proposes a self-adaptive data-cleaning framework that integrates local, global, and learning dynamics cues for robust noisy-label detection. Samples are mapped into a unified low-dimensional feature space through a modular feature concatenation paradigm. We provide two instantiations: a 2D metric integrating class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that additionally incorporates a z-normalized score. Unlike conventional 1D Gaussian Mixture Models applied to a single scalar metric, our framework performs multi-metric clustering on the feature space to adaptively partition samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100 with 5% to 40% symmetric label noise show high recall across settings, including near-perfect recall (=98%) on ImageNet-100 at 40% noise. Subsequent training yields accuracy gains across evaluated settings, especially under severe corruption on ImageNet-100. These findings suggest that multi-metric integration provides a threshold-free, practical, and low-tuning strategy for noisy label detection.

[CV-39] AsyncPatch Diffusion: spatially-flexible image generation

链接: https://arxiv.org/abs/2606.07079
作者: Samuele Papa,Valentin De Bortoli,Guillaume Couairon,Daniel Sýkora,Romuald Elie,Klaus Greff
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 14 figures

点击查看摘要

Abstract:Standard diffusion models corrupt an entire sample with a single shared noise level, forcing all spatial regions to follow the same denoising trajectory. We introduce AsyncPatch Diffusion, a joint-diffusion framework that assigns distinct noise levels to different input dimensions, such as image pixels, or latent tokens. We show how this asynchronous corruption defines a valid generative process while supporting a richer family of spatially heterogeneous denoising trajectories, and prove the first valid ELBO for this process. We show that a single pretrained model can perform spatially adaptive generation, where different regions are denoised on different schedules. A key challenge is training: naive independent noise-level sampling overemphasizes highly heterogeneous configurations and underrepresents homogeneous noise levels, that are crucial during sampling. We address this with a controlled noise-level sampler that regulates both the average corruption level and its spatial variability. AsyncPatch achieves generation quality comparable to conventional diffusion on ImageNet 256 and LSUN, while being natively suited for inpainting without task-specific fine-tuning. We further introduce input guidance, which uses clean or partially corrupted regions to guide the generation of unknown regions, improving local consistency and texture matching. Finally, we demonstrate adaptive generation strategies including uncertainty-guided acceleration and autoregressive sampling.

[CV-40] Constructing VAE Latent Spaces with Prescribed Topology

链接: https://arxiv.org/abs/2606.07058
作者: Jilles S. van Hulst,Jakub M. Tomczak,W.P.M.H. Heemels,Duarte J. Antunes
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT); Machine Learning (stat.ML)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Variational autoencoders (VAEs) learn low-dimensional latent representations of high-dimensional data. When the data lies on a manifold with non-Euclidean topology, the standard Gaussian prior introduces a topological mismatch that degrades reconstruction quality and prevents faithful representation. We present a constructive mathematical framework that resolves this mismatch for all manifolds that admit a product covering space. These are manifolds expressible as products of elementary factors (circles, intervals, or lines) or as quotients of such products by a finite symmetry group. The class includes cylinders, tori, Möbius strips, Klein bottles, and real projective spaces. Factorized distributions over the elementary factors yield product topologies with closed-form, decoupled KL divergences, so that each latent factor can be shaped independently while keeping training tractable. We catalogue reparametrizable encoder-prior pairs for periodic, bounded, and unbounded supports, and provide coordinate transformations that allow standard neural networks to output non-Euclidean parameters with smooth gradients. For quotient manifolds, the decoder receives group-invariant features of the covering-space coordinates, so that identified points produce identical outputs. Anchor constraints fix the coordinate system relative to the data or create soft topological holes. Experiments on synthetic manifolds and real-image datasets (rotated and cyclically shifted MNIST) confirm that a topology-matched prior aligns KL regularization with the data manifold. The resulting topology-aware models outperform the Gaussian baseline at all practically relevant regularization strengths. The code is available at this https URL.

[CV-41] rioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation

链接: https://arxiv.org/abs/2606.07053
作者: Dian Gu,Zhengyi Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages (9 pages main body, 6 pages references and appendix), 3 figures, 5 tables

点击查看摘要

Abstract:Pose-guided text-to-image generation often suffers from limb distortions and feature crosstalk in complex multi-person scenarios. While existing UNet-based adapters struggle with long-range spatial dependencies, emerging Multimodal Diffusion Transformers (MM-DiTs) offer superior global modeling. However, naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions. To address this, we propose TrioPose, a native pose-driven framework built upon the SD3.5M architecture. Specifically, we introduce a Triple-Stream Pose-Aware DiT (TSPA-DiT) that treats pose as an independent modality. It employs layer-wise activation and zero-initialized dual-residual injection to smoothly enforce geometric constraints while preserving pre-trained latent stability. To resolve severe multi-instance occlusions, we design a Learnable Relational Bias Mask that categorizes topological connectivity into fine-grained physical states, mapping them into continuous attention soft constraints to effectively decouple inter-instance interference. Furthermore, a Pose-Guided Spatial Loss Weighting strategy modulates the native diffusion objective using heatmap-derived error maps, focusing anatomical supervision strictly on distortion-prone regions. Extensive experiments demonstrate that TrioPose achieves state-of-the-art performance across challenging benchmarks, including Human-Art, CrowdPose, and OCHuman. Notably, it attains an AP of 64.33 on Human-Art, representing a 30% improvement over prior arts, while setting new standards for visual fidelity and text-image semantic alignment in complex multi-human generation.

[CV-42] STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation

链接: https://arxiv.org/abs/2606.07036
作者: Won June Cho,Daeky Jeong,Hyeongyeol Lim,Hongjun Yoon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 27 pages, 7 figures

点击查看摘要

Abstract:Synthetic histopathology image generation addresses critical challenges in computational pathology, including patient privacy and the growing need for large-scale training data for foundation models. Latent diffusion models have dominated the image generation domain, with recent works emphasizing that the choice of latent space is critical to the quality of generated images. Existing state-of-the-art generative models in histopathology use pretrained Vision Foundation Models (VFMs) as conditioning signals, and we observe that this leads to “conditioning collapse,” where the conditioning signal dominates the latent space and lowers the quality and diversity of generated samples. Therefore, we instead use pretrained histopathology VFMs as the latent space itself, leveraging their patch-token features that encode rich semantic information. We empirically show that these features are \ell_2 -normalized and lie on the unit hypersphere \mathcalS^d-1 with strong angular dominance and intrinsic curvature, making them naturally suited for a Riemannian formulation. We therefore present STREAM, the first framework to apply Riemannian flow matching in the pathology domain. STREAM consists of two stages: 1) a bridge-type stochastic perturbation that establishes per-token rectifiability on \mathcalS^d-1 for training a Diffusion Transformer (DiT) in latent space, and 2) a novel anisotropic decoder that allocates robustness to low-energy directions of the velocity-field Jacobian while preserving fidelity along its high-energy directions. Together, STREAM achieves state-of-the-art reconstruction and generation performance on breast and colorectal cancer datasets. The code will be publicly released upon acceptance.

[CV-43] ForensicConcept: Transferable Forensic Concepts for AIGI Detection ICML2026

链接: https://arxiv.org/abs/2606.07034
作者: Menyanshu Zhou,Ziyin Zhou,Ke Sun,Yunpeng Luo,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:AI-generated image detectors achieve high accuracy on in-distribution data but often fail on unseen generators. A key obstacle to understanding this failure is the black-box nature of current detectors: they do not reveal which evidence drives their decisions. We propose ForensicConcept, a framework that extracts explicit forensic concepts from detectors and enables their transfer across backbones. Our method localizes decision-critical patches via Transformer attribution, clusters them into a compact concept codebook, and uses a concept-aligned projection to produce auditable evidence readouts. Motivated by prior studies showing that DINO representations can guide diffusion generation and exhibit concept-level correspondence with diffusion features, we introduce a generation-trace reference based on CleanDIFT diffusion features and quantify backbone-trace alignment via neighborhood-structure consistency (CKNNA). We further propose concept codebook injection to transfer diffusion-derived concepts into target backbones. Experiments on GenImage, GAN-family, and Chameleon benchmarks show consistent improvements over prior methods. We also find that CKNNA alignment predicts transfer effectiveness, providing a principled explanation for why some backbones yield more transferable forensic evidence than others.

[CV-44] Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

链接: https://arxiv.org/abs/2606.07033
作者: Zhe Yang,Ruyi Zhang,Hongtao Chen,Wenrui Li,Hengyu Man,Wangmeng Zuo,Xiaopeng Fan
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

[CV-45] Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

链接: https://arxiv.org/abs/2606.07032
作者: Zhenyu Yang,Zemin Du,Shengsheng Qian,Changsheng Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP’s pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at this https URL.

[CV-46] GuideCAD: A Lightweight Multimodal Framework for 3D CAD Model Generation via Prefix Embedding

链接: https://arxiv.org/abs/2606.07024
作者: Minseong Kim,Jinyeong Park,Sungho Park,Jibum Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal approaches used for 3D CAD generation require substantial computational resources, necessitating efficient training. To address this, we propose GuideCAD, which leverages semantically rich visual-textual representations having only a small number of trainable parameters to generate 3D CAD models. Specifically, GuideCAD uses a mapping network that converts image embeddings into prefix embeddings, enabling a pretrained large language model to integrate visual and textual information. As a result, a transformer-based decoder predicts the construction sequence using the visual-textual embeddings in order to generate the 3D CAD model. For experimental evaluation, we construct a new dataset, referred to as GuideCAD, which consists of text-image pairs. Each pair includes a text prompt that represents a 3D CAD construction sequence and its corresponding 3D CAD image. Our experimental results show that GuideCAD generates comparably high-quality 3D CAD models while using approximately four times fewer parameters and achieving twice the training efficiency compared to fine-tuning approaches. We have released the source code and dataset for our method at: this https URL

[CV-47] Dont Pause: Streaming Video-Language Synchrony for Online Video Understanding

链接: https://arxiv.org/abs/2606.06991
作者: Zhenyu Yang,Kairui Zhang,Shengsheng Qian,Weiming Dong,Changsheng Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emphper-frame incremental, sub-budget decoding: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone’s general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and “thinking” alongside visual input.

[CV-48] CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

链接: https://arxiv.org/abs/2606.06978
作者: Zihan Liu,Yuguang Yang,Shengjie Su,Jianing Pang,Linlin Yang,Chunyu Xie,Nikolai Yu. Zolotykh,Baochang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Object Detection (COD) requires a detector to acquire new categories over time while preserving previously learned ones. This goal is closely related to open-vocabulary detection, since both settings require reasoning over categories that are not fully covered by the annotations available at the current training stage. Recent CLIP-based open-vocabulary detectors have shown strong zero-shot generalization, and frameworks such as F-ViT demonstrate that vision-language pretraining can provide powerful zero-shot detection ability for unseen categories. However, real-world deployments cannot remain purely zero-shot: once these detectors are continually updated on newly introduced categories, they suffer severe catastrophic forgetting and quickly lose their previously calibrated detection ability. We therefore propose CL-CLIP, a CLIP-based COD framework that equips open-vocabulary detectors with better continual learning ability through cost-volume-guided category decoupling. Specifically, following CAT-Seg, we compute a CLIP image-text similarity cost volume, defined as dense category-wise response maps between visual tokens and class text embeddings. This zero-shot spatial prior decomposes shared region features into class-specific pathways, which are then processed by a Multi-Expert RoI head. Extensive experiments on PASCAL VOC and MS-COCO show that CL-CLIP substantially improves the F-ViT baseline under continual fine-tuning and achieves competitive performance with existing continual object detectors, especially in adapting to newly introduced categories while preserving competitive base-class performance.

[CV-49] From Vision to Text: A Compact Multimodal Approach for Robust Cross-Domain Presentation Attack Detection on ID Cards

链接: https://arxiv.org/abs/2606.06966
作者: Qingwen Zeng,Juan E. Tapia,Sneha Das,Christoph Busch
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Publication under the revision process on IEEE

点击查看摘要

Abstract:Cross-domain shifts challenge Presentation Attack Detection (PAD) on ID Cards, given the restricted data available due to privacy concerns. This work proposes a compact multimodal model, based on new generative and discriminative blocks, which combines visual and textual data for PAD on genuine and synthetic ID images. While multimodal models exhibit strong generalisation after supervised fine-tuning, they fail in zero-shot settings. Our findings underscore that model capacity and real-world data are essential for reliable PAD, while existing synthetic datasets may not reflect real-world challenges. We argue for a re-evaluation of synthetic data as a benchmark and emphasise the need for more realistic, diverse datasets to advance PAD research.

[CV-50] MVSegNet: A Lightweight Boundary-Aware Network for Fetal Lateral Ventricle Segmentation and Atrial Width Estimation in Prenatal Ultrasound

链接: https://arxiv.org/abs/2606.06958
作者: Arafat Hossain Sayem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 4 tables. Code and trained models will be released upon acceptance. Supplementary material available upon request

点击查看摘要

Abstract:Fetal ventriculomegaly is assessed by measuring the atrial width of the lateral ventricle in prenatal ultrasound. Accurate segmentation is essential for this measurement, but acoustic shadowing, speckle noise, and poor contrast make it difficult. We developed MVSegNet, a lightweight encoder-decoder network combining multi-scale feature extraction and boundary-aware refinement. The model was trained and evaluated on 584 expert-annotated transventricular ultrasound frames using a 70/15/15 split. Performance was compared against six segmentation baselines using overlap, boundary, and measurement metrics. MVSegNet achieved a Dice score of 80.79%, IoU of 68.47%, Hausdorff distance of 4.07 mm, and atrial width mean absolute error of 3.40 mm. The model contains 2.31 million parameters and runs at 165.6 frames per second on an NVIDIA T4 GPU. MVSegNet outperformed all evaluated baselines on boundary and measurement metrics while maintaining low computational cost, supporting its use in automated fetal ultrasound analysis.

[CV-51] When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT

链接: https://arxiv.org/abs/2606.06950
作者: Md Enamul Hoq,Sharafat Hossain,Imraul Emmaka,Linda Larson-Prior,Lawrence Tarbox,Jonathan Bona,Donald Johann Jr.and Fred Prior
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether performance gains justify added computational cost and complexity. Rather than proposing a new architecture, we study how input dimensionality (2D, 2.5D, 3D) affects model behavior across convolutional neural networks (CNNs) and Vision Transformers (ViTs) under a fixed training protocol. Using a leakage-free NLST cohort (n = 1,977) with supporting LIDC-IDRI data, we find that the 2.5D CNN offers the most favorable discrimination-stability trade-off in our comparison (ROC-AUC 0.682, 95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs show threshold instability, and transformers exhibit degenerate predictions, such as all-positive predictions. Confidence intervals are wide and overlapping, so we present these results as a controlled resource-performance frontier and a failure-mode taxonomy rather than as definitive superiority claims. For class-imbalanced lung cancer screening classification, 2D and 2.5D inputs provide a more reliable trade-off between performance, stability, and computational efficiency than full 3D representations.

[CV-52] SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models ICML2026

链接: https://arxiv.org/abs/2606.06943
作者: Sunoh Kim,Daeho Um
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in ICML2026

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our code is available at this https URL.

[CV-53] When CLIP Sees More It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness CVPR2026

链接: https://arxiv.org/abs/2606.06938
作者: Sunoh Kim,Daeho Um
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR2026

点击查看摘要

Abstract:Vision-language models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP’s robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. Our code is available at this https URL.

[CV-54] SVHighlights: Towards Extremely Long Sport Video Highlight Detection KDD2026

链接: https://arxiv.org/abs/2606.06926
作者: Donggyu Lee,Youngbin Ki,Jeonghun Kang,Taehwan Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: this https URL

点击查看摘要

Abstract:While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +3.12 in HIT@1, +4.06 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

[CV-55] DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection ECCV2026

链接: https://arxiv.org/abs/2606.06918
作者: Abhishek Ameta,Sayan Banerjee,Shreyas Pandith,Harshit,Ankita Chatterjee,Akshay Janardan Bankar,Amit Satish Unde
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ECCV 2026

点击查看摘要

Abstract:The rapid evolution of generative image models challenges existing AI-generated image detectors, particularly in open-world settings with unseen generators. Recent training-free approaches measure robustness gaps in frozen vision foundation models (VFMs), detecting fakes via perturbation-induced embedding drift. However, these methods rely on fixed invariance geometry inherited from pretraining and lack principled adaptation to the detection task. We instead formulate AI-generated image detection as learning a structured invariance manifold of real images under one-class supervision. Building upon a frozen VFM, we introduce lightweight projection heads that decompose representation space into complementary robust and fragile subspaces. The robust subspace is explicitly trained to suppress variations induced by physically plausible imaging transformations, approximating tangent directions of a real-image manifold, while the fragile subspace retains sensitivity to edit-like perturbations. A structured ordering margin enforces hierarchical separation between physical invariance and edit-induced variability, enabling detection as a margin-violation test relative to the learned manifold. At inference, multi-scale patch-wise drift under both transformation families yields a dual-channel invariance signature and interpretable localization. Extensive experiments demonstrate strong open-world generalization across unseen generators and resolutions, consistently outperforming training-free robustness-based baselines while providing interpretable invariance-violation maps.

[CV-56] polyDAG: Polynomial Acyclicity Constraints for Efficient Continuous Causal Discovery in Visual Semantic Graphs

链接: https://arxiv.org/abs/2606.06908
作者: Wenhao Zhang,Ramin Ramezani,Tao Han,Kai Hwang,Minyi Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern image-analysis pipelines often convert images into structured semantic variables, such as facial attributes, object concepts, and scene descriptors. Learning directed dependencies among these variables can produce interpretable visual semantic graphs, but continuous directed acyclic graph learning is limited by the cost of enforcing acyclicity. We present polyDAG, a polynomial acyclicity framework for efficient continuous causal discovery in visual semantic graphs. polyDAG replaces the matrix-exponential acyclicity constraint with a finite polynomial trace constraint and proves that the new constraint is zero exactly for acyclic graphs. We further derive a geometric-series implementation that avoids the explicit summation loop while preserving the same acyclicity condition. Experiments on synthetic Erdos-Renyi graphs and CelebA facial visual attributes show that polyDAG improves efficiency and structure recovery. Averaged over the revised synthetic protocol with d in 100, 200, 500, polyDAG reduces mean structural Hamming distance from 318.4 to 285.4 and improves mean F1 score from 0.725 to 0.756. At 100 nodes, the geometric variant runs in 3.44 seconds compared with 5.16 seconds for the exponential baseline, corresponding to a 33.4 percent speedup. Code and data are publicly available at this https URL.

[CV-57] ActionMap: Robot Policy Learning via Voxel Action Heatmap

链接: https://arxiv.org/abs/2606.06904
作者: Pei Yang,Hai Ci,Yanzhe Chen,Qi Lv,Han Cai,Mike Zheng Shou
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone’s hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT’s L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: this https URL.

[CV-58] Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy ICLR2026

链接: https://arxiv.org/abs/2606.06903
作者: Yuan Zeng,Yujia Shi,Yuhao Yang,Dongxia Liu,Zongqing Lu,Wenming Yang,Qingmin Liao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at this https URL.

[CV-59] LUCID: Learning Unified Control for Image Deflaring and Exposure Mastery in Nighttime Photography SIGGRAPH2026

链接: https://arxiv.org/abs/2606.06901
作者: Tingyu Yang,Yuan Cheng,Xiaoyun Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH 2026

点击查看摘要

Abstract:Photography is the art of painting with light, yet nighttime scenes are shaped by competing degradations: intense flares obscure scene structure, while photon-limited regions collapse into noise. Conventional approaches address these factors in isolation, overlooking the fact that these degradations are fundamentally entangled. To bridge this gap, we introduce LUCID, a unified framework that reframes nighttime restoration as a continuous and controllable process rather than a fixed correction. We decompose nighttime restoration into two cooperative components: a flare disentanglement module that lifts the ‘curtain’ of optical artifacts to provide reliable structural guidance, and a diffusion-driven module that leverages generative priors to reconstruct clean and well-exposed imagery. Crucially, LUCID introduces explicit controllability through a novel four-mode training strategy, enabling users to steer the restoration process via classifier-free guidance (CFG) and allowing selective control over light sources and their associated flare and ghosting artifacts, while also supporting high dynamic range (HDR) reconstruction through continuous exposure control. Extensive experiments demonstrate that LUCID consistently outperforms state-of-the-art methods across diverse real-world nighttime scenarios.

[CV-60] Lighting-Aware Representation Learning under Controllable Lighting Variation

链接: https://arxiv.org/abs/2606.06899
作者: Lizhen Zhu,Charantej Reddy Pochimireddy,James Z Wang,Brad Wyble
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Variations in illumination remain a major challenge for visual representation learning, as they induce substantial appearance changes both across and within environments. While existing approaches typically address this issue through data augmentations that encourage models to become invariant to lighting changes, such strategies do not explicitly model lighting information during learning. Inspired by theories of human vision, we propose a lighting-aware representation learning framework that incorporates illumination variation as an explicit training signal rather than a nuisance factor to be suppressed. Our method extends contrastive learning by introducing an auxiliary objective that captures illumination-dependent variation in rendered scenes, enabling the model to jointly learn representations that preserve semantic consistency while remaining sensitive to lighting-dependent visual structure. We evaluate the proposed model on image classification and object detection tasks across the ImageNet, ExDark, and PASCAL VOC benchmarks. Results demonstrate that the proposed lighting-aware training consistently improves downstream performance over standard contrastive learning baselines, while maintaining the same architecture and training budget. Furthermore, our approach shows promising performance in supervised learning frameworks and under settings involving simpler lighting variation, suggesting broad applicability beyond complex illumination scenarios. These results indicate its potential to enhance model robustness and adaptability in complex visual environments as well as in more conventional image processing tasks.

[CV-61] Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

链接: https://arxiv.org/abs/2606.06891
作者: Hanxun Yu,Xuan Qu,Lei Ke,Boqiang Zhang,Yuxin Wang,Jianke Zhu,Dong Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM’s next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at this https URL

[CV-62] Diagnosing Visual Ignorance in Vision-Language Models

链接: https://arxiv.org/abs/2606.06890
作者: Runyu Zhou,Qi Zhang,Qixun Wang,Yisen Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark evaluation remain insufficiently understood. In this work, we study language-prior reliance from both mechanistic and behavioral perspectives. Internally, we combine counterfactual layer replacement with supervised layer-wise MLP probing to trace how ground-truth visual semantics and language-prior semantics compete across the language decoder. Our analysis reveals a multi-stage bottleneck: intermediate layers often fail to effectively retrieve visual information, while later layers can further suppress surviving visual signals in favor of text-space biases. Externally, we introduce a progressive visual decay metric based on multi-step Gaussian blurring, which identifies instances whose answers remain invariant even as visual content is increasingly destroyed. Across twelve visual question-answering benchmarks and three representative VLMs, we find that a substantial fraction of examples remain answerable under severe or total visual obfuscation, indicating that current benchmarks can inadvertently reward visual ignorance. These findings demonstrate that language-prior reliance is a systematic routing failure affecting both model internals and benchmark validity. Finally, we outline critical pathways for future research, highlighting the necessity of designing training distributions and evaluation protocols built on structurally isolated or counterfactual data to enforce genuine cross-modal grounding.

[CV-63] ARAPDiffusion: ARAP Regularization for Diffusion-Based Deformable Shape Space Learning

链接: https://arxiv.org/abs/2606.06887
作者: Haibo Liu,Jinghan Ke,Haitao Yang,Xiangru Huang,Georgios Pavlakos,Qixing Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces ARAPDiffusion, a latent diffusion model to learn the underlying continuous shape space of a deformation shape collection. The key innovation is in injecting the as-rigid-as-possible (ARAP) deformation model as regularization losses into latent diffusion (LD), releasing the requirement of having abundant 3D training data for learning generative models. In contrast to the standard LD, we show how the ARAP model can be used to improve both the encoder/decoder and the LD model. The training procedure alternates between using the synthetic distribution defined by the LD model to develop a regularization loss that enhances the shape encoder/decoder and using the shape decoder to develop a regularization loss to improve the LD model. We also show the benefit of the LD paradigm in combining a representation-free LD process and an implicit shape decoder that is applicable to unorganized point clouds. The experimental results of unconditional and conditional shape generation demonstrate the advantages of ARAPDiffusion over baseline approaches.

[CV-64] FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising ICASSP2026

链接: https://arxiv.org/abs/2606.06885
作者: Yuan Zeng,Yujia Shi,Zongqing Lu,QingMin Liao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICASSP 2026

点击查看摘要

Abstract:Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emphFreeAnimate, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at this https URL.

[CV-65] A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

链接: https://arxiv.org/abs/2606.06878
作者: Kangjian Zhu,Haobo Jiang,Jianjun Qian,Jin Xie
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Corresponding author: Jin Xie

点击查看摘要

Abstract:In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at this https URL.

[CV-66] Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows ICML26

链接: https://arxiv.org/abs/2606.06875
作者: Xiang Yang,Feifei Li,Mi Zhang,Geng Hong,Xiaoyu You,Mi Wen,Min Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: ICML26

点击查看摘要

Abstract:Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator (UVR), a training-free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a task-independent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accurately localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation. Code is available at this https URL.

[CV-67] EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation ICASSP2026

链接: https://arxiv.org/abs/2606.06872
作者: Yuan Zeng,Zilue Gao,Yujia Shi,Zongqing Lu,Wenming Yang,QingMin Liao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICASSP 2026

点击查看摘要

Abstract:Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emphEgoPressDiff, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at this https URL.

[CV-68] Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis

链接: https://arxiv.org/abs/2606.06867
作者: Sanket Kachole,Siddhesh Thakur,Shubham Innani,Sanyukta Adap,Suhang You,Carla Pitarch-Abaigar,Spyridon Bakas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern medicine relies on heterogeneous data sources spanning radiology, pathology, text reports, and structured clinical information. However, real-world patient data are frequently incomplete, with missing or sparsely acquired modalities, limiting the effectiveness of standard multimodal fusion approaches. To this end, we propose the Multimodal Flexible Redundancy-aware decomposed GAted Learning (Multi-FRuGaL) framework, a decomposition-aware, adaptive gated intermediate-fusion framework that performs modality-level representation learning under missing data. Multi-FRuGaL integrates per-modality encoders with a signal decomposition layer, an input-conditioned gating network, and an information-aware fusion objective to separate redundant from modality-specific complementary signals, selectively upweighting informative modalities and suppressing redundant or noisy inputs, and remaining well-defined even when multiple modalities are absent. We evaluate Multi-FRuGaL on two multimodal head and neck cancer cohorts: the HANCOCK challenge dataset (N = 763) comprising five modalities and two prognostic endpoints (5-year survival and 2-year recurrence), and the HECKTOR challenge dataset (N = 588) comprising three modalities for human papillomavirus (HPV) status classification. Multi-FRuGaL consistently achieves higher mean performance than the evaluated baselines across multiple tasks, improving AUC from 0.601 to 0.8496 for survival, from 0.672 to 0.8102 for recurrence, and achieving 0.975 AUC for HPV prediction on HECKTOR. For survival analysis, it further achieves a concordance index of 0.6814 for overall survival, 0.7421 for recurrence-free survival, and 0.7143 for progression-free survival on HANCOCK, and 0.7203 for recurrence-free survival on HECKTOR. Qualitative analyses further show that Multi-FRuGaL learns discriminative and robust multimodal representations, even under severe missing-modality conditions.

[CV-69] LRMIL: Efficient Low-Resolution Multiple Instance Learning via High-Resolution Knowledge Distillation for Whole Slide Image Classification

链接: https://arxiv.org/abs/2606.06864
作者: Yonghan Shin,Won-Ki Jeong
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multiple instance learning (MIL) has become a standard paradigm for whole slide image (WSI) analysis in digital pathology, as it enables slide-level prediction without dense annotations. Existing MIL methods typically rely on exhaustive extraction and encoding of high-resolution patches. However, this practice suffers from two critical limitations in real-world clinical settings: it struggles to capture global visual cues at lower magnifications, and incurs substantial computational overhead due to the massive number of high-resolution patches per slide. To address these limitations, we propose an efficient low-resolution multiple instance learning (LRMIL) framework that transfers high-resolution knowledge to low-resolution representations. LRMIL adopts a two-stage distillation strategy. First, patch-level cross-resolution distillation aligns low-resolution patch embeddings with high-resolution representations. Second, slide-level knowledge distillation trains a low-resolution student MIL model under both slide-level supervision and teacher guidance. At inference time, LRMIL operates exclusively on low-resolution patches, substantially reducing data preprocessing and computational cost. Extensive experiments on multiple WSI benchmarks demonstrate that LRMIL consistently outperforms state-of-the-art MIL methods while achieving more efficient inference. These results highlight LRMIL as a practical and scalable solution for WSI analysis in clinical pathology.

[CV-70] FS-DVS: A Frequency-Selective Dynamic Visual Sensing Paradigm for Enhancing Information Completeness

链接: https://arxiv.org/abs/2606.06856
作者: Feiyu Ji,Xiaokang Yang,Xiaoyun Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic vision sensors (DVS) offer exceptional temporal resolution and dynamic range by asynchronously reporting pixel-level intensity changes. However, conventional DVS rely on a per-pixel independent triggering mechanism, ignoring the spatial integration performed by biological retinal ganglion cells (RGCs). Consequently, they lack the contrast sensitivity function (CSF) and its inherent sensitivity to mid-spatial frequencies, which inevitably leads to information incompleteness due to sub-threshold signal loss. To bridge this gap, we propose FS-DVS (Frequency-Selective Dynamic Vision Sensor), a novel paradigm that integrates a learnable spatial filter strictly preceding the event triggering process to mimic the RGC aggregation mechanism. By developing a differentiable event simulation framework, the spatial filter can be optimized end-to-end with downstream tasks. Our study reveals that starting from a delta function, the learned spatial filters spontaneously evolve into center-surround patterns that emphasize mid-frequency components, consistently aligning with human CSF. Beyond achieving substantial performance gains in object detection and action recognition, the consistent convergence to human-like CSF characteristics across different tasks underscores the universality of this mid-frequency selective mechanism. Compared to naively increasing sensor sensitivity or relying on post-processing, our paradigm achieves selective information enhancement with high noise resilience, providing a robust, biologically plausible blueprint for next-generation neuromorphic sensors.

[CV-71] MotionEnhancer: Leverag ing Video Diffusion for Motion-Enhanced Vision-Language Models CVPR2026

链接: https://arxiv.org/abs/2606.06853
作者: Yifan Xu,Chao Zhang,Ruifei Ma,Fei Gao,Zhifei Yang,Jiaxing Qi,Zhipeng Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.

[CV-72] CFRNet: Cycle-Consistent Fixed-Point Training for Real-Time Blind Face Restoration on Consumer Embedded NPUs

链接: https://arxiv.org/abs/2606.06850
作者: Fuchen Li,Xinyang Wang,Yahui Zhang,Yuhan Chen,Jiahong Guo,Zhuohan Qin,Wenbo Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 this http URL and project page will be released

点击查看摘要

Abstract:Blind face restoration on consumer devices has to balance image quality against speed and memory. Strong methods such as GFPGAN and CodeFormer give good perceptual quality, but they rely on large pretrained generative priors and on operators such as attention, codebook lookup, and style modulation that are hard to compile and quantize on the small neural processing units (NPUs) used in consumer hardware. Small convolutional restorers run fast enough, but they tend to over-smooth and to leave artifacts around the eyes, nose, and mouth. We present CFRNet, a 2.0,M-parameter ResNet-style restorer for on-device use at 256\times256 , the common face-crop size on consumer NPUs. The main idea is Cycle-Consistent Fixed-Point Training (CCFP). Instead of training the network for one pass and then running it several times by hand, we train it to act as a fixed-point operator, so that applying it again to a restored face does not change the face. CCFP uses three training losses, namely progressive multi-cycle supervision, an idempotence loss, and a re-degradation cycle loss, and it adds no cost at inference. To compare fairly under our deployment limits, we retrain all baselines from scratch at the same 256\times256 resolution. On a 300-image test set, CFRNet reaches the best perceptual score (LPIPS 0.250 at three cycles, which is 31% lower than one cycle) and also the best PSNR and SSIM at two cycles. It runs in about 23,ms per cycle in INT8 on a HiSilicon Hi3402 NPU, while the same baselines cannot be compiled to that chip. The cycle count k acts as a simple quality knob that needs no retraining: PSNR is best at k!=!2 and LPIPS keeps improving up to k!=!3 . We further show that the same idea works with a plain CNN that is even easier to deploy, and we run the model in real time on an in-car driver-monitoring board.

[CV-73] hink Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

链接: https://arxiv.org/abs/2606.06836
作者: Xiangyi Zheng,Xiangyu Wang,Qinan Liao,Zimu Tang,Yue Liao,Dongyue Lyu,Guodong Wang,Junjie Liu,Si Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbfFLIGHT, a \textbfFine-grained \textbfLong-horizon \textbfInstruction-\textbfGuided benchmark for \textbfHybrid UAV navigation and reasoning \textbfTasks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbfFLIGHT VLA, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbfPilot Reasoning texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

[CV-74] AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

链接: https://arxiv.org/abs/2606.06828
作者: Jiazi Bu,Pengyang Ling,Yujie Zhou,Yibin Wang,Yuhang Zang,Tianyi Wei,Xiaohang Zhan,Jiaqi Wang,Tong Wu,Xingang Pan,Dahua Lin
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Website: this https URL

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has demonstrated remarkable success in aligning text-to-image (T2I) flow models with human preferences. However, we have identified that the learning loop of current flow-based GRPO is fundamentally decoupled from the learner’s current capability, suffering from critical blind spots at both prompt selection and advantage estimation: (i) Existing methods sample prompts randomly, overlooking the substantial impact of data selection on reinforcement learning (RL) efficacy–a factor proven crucial in GRPO for large language models; (ii) They evaluate sample quality solely relying on intra-group statistics, lacking a global perspective to accurately measure true policy improvement. To address these issues, we propose Adaptive GRPO (AdaGRPO), a novel capability-aware RL algorithm tailored for flow models. Specifically, AdaGRPO consists of two principal components: (i) Online Curriculum Filtering Strategy: Dynamically tracks the model’s proficiency and adaptively selects prompts that best match its current learning boundary; (ii) Cross-Level Advantage Fusion: Synergistically integrates fine-grained intra-group advantages with macro-level global advantages, providing a comprehensive and unbiased policy evaluation. As a lightweight, plug-and-play module, AdaGRPO can be seamlessly integrated with existing frameworks such as Flow-GRPO, DanceGRPO, and Flow-CPS. Extensive experiments demonstrate that AdaGRPO consistently drives performance gains while significantly stabilizes GRPO training for flow models.

[CV-75] VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation ICML2026

链接: https://arxiv.org/abs/2606.06819
作者: Ming Dai,Sen Yang,Boqiang Duan,Boyuan Tong,Jiedong Zhuang,Wankou Yang,Jingdong Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML2026

点击查看摘要

Abstract:Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixel-level localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose \textbfVideoSEG-O3, the first multi-turn reinforcement learning framework for RVOS that emulates the human \textit``coarse-to-fine’’ cognitive process. It employs a \textitmulti-turn temporal-spatial chain-of-thought to capture fine-grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of \texttt[SEG] during the RL stage, we introduce \textitSEG-aware logit calibration, which integrates pixel-wise segmentation feedback directly into the token-level logits. Furthermore, we design a \textitdecoupled thinking trace to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct \textbfVTS-CoT, a specialized cold-start dataset featuring comprehensive reasoning trajectories. The code and models will be released at this https URL.

[CV-76] Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation ICML2026

链接: https://arxiv.org/abs/2606.06813
作者: Dahee Kwon,Haeun Lee,Jaesik Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026. Code is available at: this https URL

点击查看摘要

Abstract:Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods alleviate this issue, but typically require expensive sampling or auxiliary optimization, incurring non-trivial overhead. To investigate the root cause of this homogeneity, we examine intermediate Transformer features and observe that the zero-frequency spatial average (DC) component rapidly converges across seeds early in generation, causing early trajectory lock-in that limits downstream variation. Building on this observation, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline with negligible overhead, improving prompt-consistent diversity while maintaining competitive image quality.

[CV-77] MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models ICML2026

链接: https://arxiv.org/abs/2606.06760
作者: Aofei Chang,Le Huang,Alex James Boyd,Parminder Bhatia,Taha Kass-Hout,Fenglong Ma,Cao Xiao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Medical large vision-language models (Med-LVLMs) have recently achieved remarkable progress in vision-language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, which is essential for achieving clinically reasoning that connects visual findings with semantic interpretation. We present MedSIGHT, a unified framework that equips Med-LVLMs with structured, pixel-level understanding for grounded visual comprehension. MedSIGHT introduces a novel Region Perceiver module that produces region-centric tokens, encoding spatial information directly into representation space of the language model. We further propose a medical region codebook into the LLM vocabulary, allowing the model to generate discrete region codes as symbolic representations of anatomical and pathological regions. These codes are decoded through the Region Perceiver to reconstruct segmentation mask, achieving end-to-end spatial grounding. Lastly, MedSIGHT combines Region Perceiver, Codebook and LLM using our proposed progressive training strategy to gradually aligns these modules stably. Trained on only 72K multimodal instruction pairs, MedSIGHT achieves state-of-the-art performance across diverse imaging modalities on both medical comprehension and segmentation tasks.

[CV-78] Anchored Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

链接: https://arxiv.org/abs/2606.06714
作者: Qian Zhang,Michal Golovanevsky,Fulvio Domini,James Tompkin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNNs do not. Do Vision-Language Models (VLMs) exhibit similar competences? Across multiple VLM families and model scales, zero-shot and in-context prompting both produce distinctive failures: slant is predicted at only a small set of anchors (e.g., 0\degree, \pm 25\degree, \pm 45\degree) with little dependence on stimulus field of view, optical slant, or surface curvature. Supervised fine-tuning partially remediates the failure, but residual anchoring persists. While success in high-level vision-language benchmarks might not require sensitivity to low-level geometric cues, we interpret anchoring as a failure at the representation-to-output language interface: Not necessarily an absence of geometric encoding, but a failure to express it in a graded form.

[CV-79] USU-Corn-WeedDB: A UAV RGB Image Dataset for Multi-Species Weed Detection in Forag e Corn

链接: https://arxiv.org/abs/2606.06709
作者: Utsav Bhandari,Saroj Burlakoti,Rhonda Miller,Sierra Young,Eric Westra,Aaron Etienne
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, 1 table

点击查看摘要

Abstract:Weed pressure in forage corn production causes yield losses of up to 31.5%, yet site-specific weed management (SSWM) systems built on UAV imagery and deep learning remain constrained by the scarcity of field-representative training datasets. We present USU-Corn-WeedDB, a publicly available UAV RGB image dataset collected from a commercial forage corn field in Cache Valley, Utah, designed to support multi-class weed detection under both supervised and semi-supervised learning frameworks. RGB imagery was acquired on 27 June 2025 using an Autel EVO II Dual 640T V2 drone at ~10m above ground level, yielding a ground sampling distance of approximately 0.48 cm/pixel. A total of 366 full-resolution images were tiled into 8,800 patches at 640 x 640-pixel resolution. Of these, 800 images were manually annotated for three weed species; common lambsquarters (Chenopodium album), redroot pigweed (Amaranthus retroflexus), and green foxtail (Setaria viridis) comprising 10,539 bounding-box instances, with the remaining 8,000 tiles retained as an unlabeled pool for semi-supervised experiments. This dataset reflects a natural class imbalance where redroot pigweed constitutes 53.86% of annotated instances, which was preserved intentionally to mirror real field conditions. To validate dataset utility, we trained 28 object detection models spanning five architecture families including YOLOv8, YOLOv9, YOLOv10, YOLO11, YOLO26, and RT-DETR under identical conditions without hyperparameter tuning. Test set mAP@0.5 ranged from 0.773 to 0.840, with lightweight models achieving competitive performance relevant to edge-deployed UAV systems. USU-Corn-WeedDB is publicly available at this https URL.

[CV-80] MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

链接: https://arxiv.org/abs/2606.06696
作者: Ryan D’Cunha,Alejandro Lozano,Xiaoxiao Sun,Daniel Vela Jarquin,Min Woo Sun,Josiah Aklilu,James Burgess,Yuhui Zhang,Ryan Nayebi,Paola Avila,Robayo,Jin Ye,Ming Hu,Zhongying Deng,Junjun He,Xin Chen,Yue Yao,Robert Tibshirani,Jeffrey J. Nirschl,Serena Yeung-Levy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.

[CV-81] S23DR 2026 Winning Solution

链接: https://arxiv.org/abs/2606.06695
作者: Jan Skvrna,Miroslav Purkrabek,Lukas Neumann
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This text presents the winning solution to the S23DR 2026 challenge for structured 3D wireframe reconstruction from sparse SfM, fitted depth, and semantic segmentations. The method treats vertices as a conditional set and denoises 64 vertex tokens with a flow-matching DiT conditioned on Perceiver-style scene tokens. A global pass predicts the coarse structure, a hull-cropped second pass refines it, and a small multi-sample consensus step keeps the stochastic sampler well behaved. The final system ranked first on the private leaderboard, achievingHSS = 0.654.

[CV-82] RPC-GS: Gaussian Splatting with native RPC Rendering for Satellite Imagery

链接: https://arxiv.org/abs/2606.06690
作者: Valentin Wagner,Sebastian Bullinger,Christoph Bodensteiner,Michael Arens
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present RPC-GS, the first Gaussian Splatting framework for satellite imagery that operates natively with Rational Polynomial Camera (RPC) models. The RPC model is the de facto standard for representing the complex imaging geometry of modern pushbroom satellite sensors. To simplify rendering, prior satellite Gaussian Splatting methods replace the RPC model with perspective or affine camera approximations, leading to geometric errors during reconstruction. RPC-GS avoids these approximations by projecting Gaussian means and covariances directly through the RPC model during the splatting process. We embed the RPC model in a chain of carefully selected geo-coordinate transformations representing a mapping from splatting-suitable scene coordinates to image coordinates. To map the Gaussian covariance matrices, we derive a numerically robust Jacobian-based covariance projection for the (partially nonlinear) coordinate transformations. Since RPCs lack an explicit notion of camera depth, we integrate a metric ray-based depth formulation. We benchmark RPC, perspective, and affine camera models in a unified framework, with our native RPC renderer consistently achieving the lowest reconstruction error on leading satellite benchmark datasets, improving mean altitude error over perspective and affine approximations by 29.6% and 63.8% on DFC2019, and by 9.9% and 37.9% on IARPA2016. We release our code to support future research of Gaussian Splatting in the satellite imaging domain.

[CV-83] RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video

链接: https://arxiv.org/abs/2606.06685
作者: Shichong Peng,Yanshu Zhang,Ke Li
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: An overview video is available at this https URL

点击查看摘要

Abstract:Static neural point reconstructions capture a subject at high fidelity from posed images. Given such a reconstruction, we aim to animate it to follow a monocular fixed-viewpoint driving video of the subject, whether captured or produced by image-to-video (I2V) generation, and to recover a rigged, re-posable 3D asset. Existing methods deform Gaussian splats through direct linear blend skinning (LBS) or mesh proxies, both of which are prone to joint-boundary artifacts under articulation, even with per-primitive corrections. We trace the artifact to the representation: each splat carries an individual shape calibrated in the canonical pose to tile with its neighbours. Under rigid LBS, each splat moves with its bone but cannot bend, so the canonical tiling breaks at joint boundaries into gaps and spikes. Proximity attention point rendering (PAPR) instead carries no per-primitive shape; each pixel is recomposed at render time from the deformed primitives’ positions, so the surface re-forms naturally with the articulation. We present RigPAPR, which auto-rigs a static PAPR cloud and drives it under direct LBS from a single fixed-viewpoint video, without mesh proxy, pose-dependent correction, or category template. On synthetic subjects, RigPAPR matches the strongest baseline at the supervised view and exceeds mesh-based and Gaussian-splatting baselines at novel views by 3+dB PSNR, with cleaner joint-boundary renderings of both synthetic and real subjects.

[CV-84] Adaptive Band Selection for Hyperspectral Classification with Spatially Disjoint Evaluation

链接: https://arxiv.org/abs/2606.06684
作者: Ikram El-Hajri(1),Ouassim Karrakchou(1),Alejandro Mousist(2) ((1) International University of Rabat, Rabat, Morocco, (2) Thales Alenia Space, Spain)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Hyperspectral band selection methods based on differentiable selectors can be sensitive to initialization and to extracting a final discrete subset, while prescribed band counts limit flexibility. We propose SGBR-HC (Spectral-Group Band Ranking with Hard-Concrete initialization), a two-stage method that uses a supervised spectral ranking to initialize trainable sparse gates rather than treating ranking as a fixed selection rule, letting the number of selected bands be determined by training. Stage-1 scores candidate bands from training pixels by class discriminability and spectral diversity; this ranking seeds the gate logits for Stage-2, which trains the sparse gates jointly with a spatial classifier. Under spatially disjoint evaluation on Pavia University and Houston 2013, verified by retraining a fresh classifier on the selected bands, SGBR-HC achieves the highest mean overall accuracy and Cohen’s kappa with approximately twenty bands. Bypassing Stage-1 degrades OA by 8.84 pp on Pavia University and 22.15 pp on Houston 2013, confirming the ranking prior’s role. Random pixel splits inflate OA on Pavia University by 30.56 pp, underscoring spatial leakage as a critical evaluation confound.

[CV-85] JA-SIREN: Deterministic Initialization for Sinusoidal Networks via Spectral Matching

链接: https://arxiv.org/abs/2606.06671
作者: Mohammed Alsakabi,Kejia Hu,John M. Dolan,Ozan K. Tonguz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing implicit neural representation (INR) approaches suffer from stochastic initialization that does not guarantee consistent or high-quality performance across runs, with variations reaching more than 2.5 dB (78%) in image regression. This variation is problematic for scientific computing and simulation, where result reproducibility is crucial. To address this problem, we present Jacobi-Anger Sinusoidal Representation Network (JA-SIREN), a deterministic initialization scheme for sinusoidal networks grounded in classical spectral analysis. By computing the Discrete Sine Transform (DST) of the target signal and leveraging the Jacobi-Anger expansion, we derive closed-form weights for a two-layer sinusoidal MLP that analytically match the network’s initial spectral response to the target signal, requiring no random seed or additional hyperparameter tuning. On the Kodak dataset, JA-SIREN achieves a mean PSNR of 67.18 dB, a 21.30 dB improvement over the best baseline. This is achieved with zero run-to-run variance, confirming that spectrally-informed initialization is a more effective and reproducible alternative to stochastic initialization for sinusoidal INRs.

[CV-86] Architecture-Adaptive Uncertainty Fusion for Deepfake Detection

链接: https://arxiv.org/abs/2606.06666
作者: Ritesh Sharma,Mohammad Ghasemigol,Yuichi Motai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake detection systems achieve near-perfect accuracy on benchmarks, yet forensic deployment demands reliable prediction uncertainty. Existing uncertainty quantification (UQ) methods rely on single sources and ignore that optimal uncertainty composition varies across architectures. We propose Correlation-Optimized Fusion (COF), an architecture-adaptive framework that fuses five complementary uncertainty sources – epistemic, aleatoric, calibration, conformal, and distributional – by maximizing Pearson correlation between fused uncertainty scores and prediction errors via constrained optimization on the probability simplex. COF requires no model modifications and only 42 s of weight optimization, compared to 20–45 h for a 5-model Deep Ensemble. Evaluation across eleven architectures on FaceForensics++ reveals a fundamental trade-off: under matched train/evaluation protocol, non-linear methods achieve approximately 5–6% higher in-domain correlation than COF (mean r = 0.438), but this reverses under distribution shift. On CelebDF, COF outperforms Random Forest in 9/11 architectures with up to 7.3x higher correlation (MaxViT-B: r = 0.249 vs. 0.034); RF degrades 85% cross-domain to r = 0.071, whereas COF retains substantially more signal (74% drop to r = 0.116). Cross-dataset evaluation on CelebDF and DFDC reveals catastrophic generalization failure across all methods: in-domain correlations of 0.41–0.47 collapse to near-zero externally (mean degradation 90.7%), with seven of eleven architectures exhibiting uncertainty inversion. These results establish COF as a practical, interpretable framework for controlled-distribution deployment and identify domain-adaptive UQ as the central open challenge for forensic deployment.

[CV-87] Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

链接: https://arxiv.org/abs/2606.06664
作者: Tang Li,Yanlin Chen,Mengmeng Ma,Xi Peng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In Proceedings of the International Conference on Machine Learning, 2026. (acceptance rate 26.6%)

点击查看摘要

Abstract:Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: this https URL.

[CV-88] From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video

链接: https://arxiv.org/abs/2606.06631
作者: Jessy Lauer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Joint contact forces govern implant longevity, cartilage health, and rehabilitation outcomes, shaping who develops osteoarthritis, who recovers well from joint replacement, and who benefits from biomechanical interventions. Yet they remain measurable only invasively, in a few dozen patients with instrumented implants. I present a physics-free pipeline to predict instantaneous 3D hip and knee contact forces from an uncalibrated monocular video: no markers, force plates, electromyography, subject-specific imaging, or musculoskeletal model. Parametric body meshes are recovered per frame, encoded as kinematic features, and decoded into forces by a transformer whose pose stream is adaptively modulated at every layer by body shape, joint, side, activity text, and self-supervised video tokens (V-JEPA 2), unifying hip and knee in a single model. Under leave-one-subject-out cross-validation across 26 patients and 25 activity categories from the in vivo OrthoLoad database, the pipeline matches the accuracy of subject-specific musculoskeletal simulations ( 0.32 \pm 0.08 BW RMSE for hip; 0.23 \pm 0.03 BW for knee) and resolves peak force changes smaller than those reported for gait retraining and osteoarthritis progression. Applied zero-shot to an independent instrumented cohort, it rivals or outperforms prior published methods. Even without curated activity labels, video features alone preserve accuracy and enable end-to-end inference on raw footage. Driven by the predictor, a generative motion prior produces biomechanically plausible variants with reduced peak loading, rediscovering strategies from the predictive simulation literature. This pipeline establishes uncalibrated monocular video as a viable modality for estimating joint loading, opening a path toward retrospective analysis of archived clinical recordings, primary-care screening, and at-home rehabilitation tracking.

[CV-89] What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

链接: https://arxiv.org/abs/2606.06627
作者: Richard Li,Aditya Prakash,Andrew Wen,Saurabh Gupta,Yilun Du,Pulkit Agrawal
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The project website is here: this https URL

点击查看摘要

Abstract:Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of 29.7% in the low-robot-data regime across six manipulation tasks.

[CV-90] Direct 3D-Aware Object Insertion via Decomposed Visual Proxies ICML2026

链接: https://arxiv.org/abs/2606.06601
作者: Jingbo Gong,Yikai Wang,Yushi Lan,Yuhao Wan,Ziheng Ouyang,Rui Zhao,Ming-Ming Cheng,Qibin Hou,Chen Change Loy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2026; Project Page: this https URL

点击查看摘要

Abstract:Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object’s 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

[CV-91] Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

链接: https://arxiv.org/abs/2606.06539
作者: Yucheng Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF – dynamic temperature goodness, decoupled normalization, and multi-layer fusion – as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% – the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] – exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF’s 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines. Comments: 23 pages, 6 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2606.06539 [cs.CV] (or arXiv:2606.06539v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.06539 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yucheng Chen [view email] [v1] Thu, 4 Jun 2026 04:01:01 UTC (284 KB)

[CV-92] WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

链接: https://arxiv.org/abs/2606.06538
作者: Yida Yin,Harish Krishnakumar,Chung Peng Lee,Boya Zeng,Wenhao Chai,Shengbang Tong,Wenhu Chen,Hu Xu,Xingyu Fu,Gabriel Sarch,Aleksandra Korolova,Zhuang Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

[CV-93] Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging

链接: https://arxiv.org/abs/2606.06536
作者: Malak Allam,Khaled Shaban,Ali Hamdi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automated defect detection in high-voltage transmission-line insulators remains challenging due to severe class imbalance, large scale variation, and the small spatial extent of defect instances in Unmanned Aerial Vehicle (UAV) imagery. To address these challenges, this paper proposes AE-YOLO, an Attention-Guided AutoEncoder-Enhanced YOLO framework for robust insulator defect detection. The architecture integrates lightweight bottleneck autoencoders within a Feature Pyramid Network-Path Aggregation Network (FPN-PAN) neck. This preserves anomaly-sensitive information during multi-scale feature fusion. Convolutional Block Attention Modules (CBAM) are used throughout the backbone, enhancing feature discrimination and suppressing background interference. The framework also introduces a variance-maximizing autoencoder regularization strategy, which encourages diverse, defect-discriminative latent representations. The network trains using a unified objective that combines focal loss, Complete IoU (CIoU) loss, and autoencoder regularization to address foreground-background imbalance and improve localization accuracy. During inference, Weighted Boxes Fusion (WBF) combines predictions from YOLOv8, YOLOv10, and YOLO11. An autoencoder-guided confidence boosting mechanism improves sensitivity to rare defect categories. Experiments on the Insulator-Defect Detection dataset show that AE-YOLO with an EfficientNetV2 backbone achieves 95.10 percent mAP at 0.5, 96.40 percent precision, and 93.80 percent recall. This performance surpasses the strongest YOLO-family baseline by 5.0 points in mAP at 0.5 and 6.7 points in recall. These results confirm the effectiveness and adaptability of the framework. The model is a practical and scalable solution for UAV-based transmission-line inspection and defect monitoring.

[CV-94] GOPAgen: Motion-Aware and Efficient Agent ic Long-Video Understanding with Structural Memory and Hierarchical Reasoning

链接: https://arxiv.org/abs/2606.06532
作者: Haozhe Chi,Yang Jin,Yadong Mu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec. We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model’s ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory. Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.

[CV-95] Applying Deep Learning for cockpit segmentation in the context of mixed reality

链接: https://arxiv.org/abs/2606.06520
作者: Alexandre Leles Sousa,Pedro de Oliveira Nielson,Erick Oliveira Rodrigues,Rafael Francisco dos Santos,Giovani Bernardes Vitor
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: XXV Congresso Brasileiro de Automática - CBA 2024

点击查看摘要

Abstract:Computer vision is an area that has been growing continuously. With the advance of technologies with a first-person view, new development opportunities have emerged inside the area. Mixed reality promotes virtual environments with objects from the physical world shown in real time. For that, it’s necessary to be concerned with the immersion of the user in this simulated environment, increasingly seeking to bring it closer to a possible desired reality. This paper proposes the development of image processing in order to perform the segmentation of images to identify what is foreground and background in order to facilitate the union of virtual and real images. Thus, the present work obtain real images of the user using the off-highway truck simulator CAT793F, through a camera, to be able to perform the segmentation of such images with artificial intelligence this http URL convolutional neural network architectures “U-net” and “DeepLabV3+” are applied to perform image segmentation. As a result, metrics with around 90% accuracy were presented and and the best model was determined.

[CV-96] A Geometric Gaussian Mixture Representation of Plane Curves

链接: https://arxiv.org/abs/2606.06505
作者: Ali Darijani,Benedikt Stratmann,Jürgen Beyerer
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)
备注:

点击查看摘要

Abstract:We introduce a user defined probabilistic polygonal representation for plane curves. Given a curve, we select vertices on the curve and connect consecutive vertices by line segments to obtain a polygonal approximation. Each segment is equipped with a user defined uncertainty parameter in the normal direction. This yields a collection of thin probabilistic geometric primitives that retain the geometrz of the underlying curve while extending it beyond the idealized deterministic one dimensional formulation. For each segment, we define a Random Variable that is uniform distributed in the tangent direction of the segment and Gaussian distributed in the normal direction of the segment. By matching the first and the second central moments, this construction induces a Gaussian component whose mean lies at the segment midpoint and whose covariance encodes both tangential and normal uncertainty. Combining the segment wise components with appropriate weights yields a Gaussian Mixture Model (GMM) representation of the user defined probabilistic polygonal representation of the plane curve. The proposed framework provides an analytically tractable probabilistic model that preserves local geometry, and uncertainty in the normal direction. It applies to smooth, closed, open, non regular, and self intersecting plane curves, allows adaptive discretization and varying uncertainty in the normal direction, and as a result supports uncertainty aware geometric modeling. Experiments on a collection of canonical plane curves show that the resulting GMM capture local tangent, local normal, and local arc length; resulting in the global shape of the underlying curves to be truthfully captured as well. The representation is particularly relevant for applications in uncertainty aware CAD and digital twins, probabilistic obstacle modeling in robotics, and probabilistic trajectory planning. Subjects: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG) Cite as: arXiv:2606.06505 [cs.CG] (or arXiv:2606.06505v1 [cs.CG] for this version) https://doi.org/10.48550/arXiv.2606.06505 Focus to learn more arXiv-issued DOI via DataCite

[CV-97] Semantic-Structural Alignment for Generative Pictorial Charts

链接: https://arxiv.org/abs/2606.06498
作者: Zhida Sun,Yulin Zhang,Zheng Gu,Min Lu,Bongshin Lee,Daniel Cohen-Or,Hui Huang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 17 figures, Accepted to ACM TOG

点击查看摘要

Abstract:Traditional statistical graphics are precise but often lack the visual appeal, memorability, and engagement of pictorial charts. We present a generative framework for the automated synthesis of pictorial charts that bridges the gap between semantic expression and structural faithfulness. Rather than treating charts merely as images to be stylized, we frame the problem as a dual-conditioned generation task guided by two parallel external control signals: a text prompt capturing the semantic context of the editing intent, and a context image providing the abstract statistical chart’s global structure. To reinforce these controls within a Multi-Modal Diffusion Transformer, we introduce two complementary feature-level mechanisms: structural alignment to anchor spatial layouts to the input chart, and semantic alignment to transfer expressive textures from reference images. Generalizing across major visual channels (i.e., length, area, angle, and position) and diverse semantic domains, our method produces pictorial charts that are both artistically compelling and structurally consistent. Extensive quantitative evaluations and perceptual user studies demonstrate that our framework outperforms traditional controllable generation and image editing baselines, providing a foundation for high-fidelity, data-driven generative modeling in expressive visual storytelling. Project page: this https URL.

[CV-98] Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

链接: https://arxiv.org/abs/2606.07381
作者: Prabhjot Kaur,Hakim Ouaalam,Sedat Kandemirli,Sanjay P. Prabhu,Simon K. Warfield
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations. Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls). Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01). Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.07381 [eess.IV] (or arXiv:2606.07381v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2606.07381 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1111/jon.70137 Focus to learn more DOI(s) linking to related resources Submission history From: Prabhjot Kaur [view email] [v1] Fri, 5 Jun 2026 15:21:22 UTC (4,312 KB)

[CV-99] Beyond Backscatter: InSAR coherence from detected SAR images

链接: https://arxiv.org/abs/2606.07374
作者: Francescopaolo Sica,Andrea Pulella,Michael Schmitt
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 20 figures

点击查看摘要

Abstract:In this work, we propose a deep learning framework for coherence regression directly from detected SAR images, without the need for accurate coregistration. A Residual U-Net is trained using coherence maps derived from precisely coregistered Sentinel-1 SLC data to learn the relationship between backscatter magnitudes and coherence. The model is trained on 12-day SLC pairs and evaluated across different datasets, including coregistered SLC products and open access analysis-ready data, covering diverse radiometric properties, geometries, and locations. Experimental results demonstrate that the proposed method achieves high-resolution coherence regression with improved accuracy compared to existing intensity-based approaches. The network generalizes well across diverse geographical locations and even across different temporal baselines that were never seen at training time. Additionally, the ability to operate on globally available analysis-ready data, such as ground range detected data, e.g., distributed through Google Earth Engine, enables its large-scale application in mission design, change monitoring, and diverse mapping tasks.

[CV-100] Beyond Universality: The GCC-FER Dataset and Culture-Aware Adaptation for Dynamic Facial Expression Recognition

链接: https://arxiv.org/abs/2606.07063
作者: Sonalika Singh,Jyotirindra Dandapat,Avishi Razdan,Kshipra V. Moghe,Puneet Gupta,Lalan Kumar
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic Facial Expression Recognition (DFER) is a key enabling technology in affective computing, human-computer interaction, and intelligent multimedia systems. Despite the significant influence of cultural nuances on FER performance, most existing FER systems assume that emotional expressions are universally consistent across populations. This variation can be attributed to systematic differences in facial muscle activation patterns across cultures. A major challenge in advancing cross-cultural FER lies in the scarcity of culturally diverse benchmark datasets. To address this, a new hybrid multicultural video dataset termed Global Cross-Cultural Facial Expression Recognition (GCC-FER) is introduced. GCC-FER comprises 23,934 video samples spanning four cultural groups (African, Caucasian, East Asian, and South Asian) across seven basic expressions, combining psychologically supervised in-house data collection for underrepresented populations with rigorous ethnicity filtering of existing sources. To the best of our knowledge, GCC-FER is the first large-scale global cross-cultural DFER dataset designed to address these demographic gaps. Leveraging this dataset, behaviorally grounded cultural priors are derived for each cultural group and a global prior for practical deployment. A Culture-Aware FER (CA-FER) system is proposed to mitigate cultural bias by adaptively recalibrating latent facial representations. Extensive experiments on GCC-FER and DFEW demonstrate that the proposed system consistently improves FER performance across multicultural settings.

[CV-101] An Integrated Roadside Sensing and Communication Framework for Vulnerable Road User Safety at Signalized Intersections

链接: https://arxiv.org/abs/2606.07016
作者: Parvez Anowar
类目: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures, 2 tables. Preprint

点击查看摘要

Abstract:Vulnerable road users (VRUs) account for approximately half of urban traffic deaths globally, with intersections concentrating a disproportionate share of these casualties. Recent reviews of sensing technology for VRU protection have cataloged dozens of single-sensor and dual-sensor deployments, yet none of the surveyed systems couples multi-modal sensing with edge-side near-miss analytics and bidirectional vehicle-to-everything (V2X) and pedestrian-to-everything (P2X) messaging in a single intersection cabinet. This paper presents an integrated framework for VRU protection at signalized intersections, combining LiDAR, radar, RGB camera, and thermal camera at the perception layer, edge-based prediction and surrogate-safety analytics at the computation layer, V2X and P2X messaging at the communication layer, and adaptive signal control at the actuation layer. The framework is grounded in an empirical case study using R-LiViT, the first publicly released roadside LiDAR-Visual-Thermal dataset, which provides 200 multi-modal sequences and 2,400 annotated RGB-T frames at three German intersections. Analysis of 53,319 detection annotations reveals that VRUs comprise approximately 49% of all road-user observations, that day-to-night density drops by 38% for pedestrians and 45% for vehicles while the night distribution shows a higher close-proximity share, that per-frame close-proximity event counts vary approximately 10-fold across the eight unique locations at three intersections, and that 83% of pedestrian bounding boxes are small in image space, indicating that VRUs are typically far from any single sensor. These findings support multi-modal sensing, edge-side analytics, and adaptive context-sensitive deployment rather than uniform single-sensor solutions.

[CV-102] DaX: Learning General Pathology Representations Across Scales

链接: https://arxiv.org/abs/2606.06983
作者: Bokai Zhao,Yiyang Zhang,Long Bai,Tai Ma,Hanqing Chao,Minfeng Xu
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: this https URL.

[CV-103] Physics-Driven Semantic Scattering Structure Understanding of Aircraft Target in SAR Images

链接: https://arxiv.org/abs/2606.06847
作者: Yifei Yin,Xiaogang Yu,Hao Shi,Liang Chen,Wei Li
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic aperture radar (SAR) has become indispensable for target interpretation owing to its all-day and all-weather observation capability. In SAR target interpretation, electromagnetic scattering information provides a physically grounded cue beyond visual texture and has been widely exploited for target interpretation. However, existing methods remain dominated by local scattering center representations. Such unordered and component-agnostic representations are highly unstable for aircraft targets. As a result, physically existing components with weak scattering responses are often missed, resulting in the incomplete reconstructed topology structure. To address this limitation, we establish Semantic Scattering Structure Understanding as a new paradigm for SAR aircraft interpretation. Semantic scattering keypoints are defined to associate local electromagnetic responses with physically meaningful aircraft components, while visibility-aware attributes are introduced to retain weakly observable yet physically existed components. The keypoints are further organized into a stable semantic scattering structure. Build upon this, we propose S3U-SAR, a physics-driven framework to localize semantic scattering keypoints and construct the complete representation constrained by multi-dimensional physical priors containing scattering heterogeneity, rigid-body topology, speckle uncertainty. A confidence-gated joint supervision strategy is further introduced to alleviate optimization conflicts. We construct KP-SAR-Aircraft-1.0, the first fine-grained benchmark for semantic scattering structure understanding. Extensive experiments demonstrate that S3U-SAR achieves the best performance compared with baselines. Cross-category and cross-dataset evaluations further verify its robustness and transferability.

[CV-104] Compute-Optimal Network Design for Echocardiography Myocardial Segmentation and Perfusion Quantification using Neural Scaling Laws

链接: https://arxiv.org/abs/2606.06725
作者: Clara Rodrigo González,Matthieu Toulemonde,Lasha Gvinianidze,Cameron A. B. Smith,Oscar Bates,Roxy Senior,Fu Siong Ng,Meng-Xing Tang
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures, 5 tables, journal

点击查看摘要

Abstract:Myocardial perfusion quantification using contrast-enhanced ultrasound offers a bedside non-ionizing alternative to nuclear imaging modalities. However, its clinical adoption is hindered by time-consuming manual labelling. Automated segmentation has proved challenging due to a paucity of in-domain training data. Adapting strategies currently used to optimise large language models for large datasets, we apply neural scaling laws to predict network performance for myocardial segmentation. We extrapolate performance on subsets of the data to determine optimal network size on the CAMUS echocardiography dataset and a 25-patient contrast-enhanced ultrasound (CEUS) dataset. Finally, we validate the clinical utility of our models by comparing the final myocardial perfusion parameters with those obtained by a senior cardiologist. Extrapolation based on the scaling law is predictive of test loss at the full dataset size, allowing us to select two networks that obtained state-of-the-art performance on CAMUS with a 240-fold reduction in parameter count. We observe the gradient of the scaling law transfers from CAMUS to the CEUS dataset with a bias in the predicted losses. The automatically segmented masks perform equivalently to a senior cardiologist in myocardial perfusion quantification. These results establish neural scaling laws as a practical tool for data-driven compute-optimal model design for small imaging datasets.

[CV-105] ErA: Error-Aware Deep Unrolling Network for Single Image Defocus Deblurring

链接: https://arxiv.org/abs/2606.06540
作者: Tu Vo,Chan Y. Park
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce ErA (Error-Aware Deep Unrolling Network), an end-to-end frame work for single-image defocus deblurring. ErA jointly learns a compact kerne basis and per-pixel weights, while an error-aware term in Augmented Lagrangian unrolling corrects kernel estimation errors via alternating updates and ResUNet denoisers. It achieves state-of-the-art PSNR/SSIM on DPDD, RealDOF, and RTF, and shows strong generalization on CUHK without ground truth.

[CV-106] DSU-Net: An Attention-Enhanced Dense Skip U-Net for Breast Lesion Segmentation in Mammographic Images

链接: https://arxiv.org/abs/2606.06537
作者: Reza Bozorgpour,Mohammadreza Soltany Sadrabadi
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide, making early detection essential for effective treatment. Mammography is the primary screening modality; however, accurate delineation of suspicious lesions remains challenging and subject to inter-observer variability. Automated segmentation methods can assist radiologists by providing consistent and efficient lesion localization. This study presents DSU-Net, an attention-enhanced Dense Skip U-Net architecture for automated breast lesion segmentation in mammographic images. The proposed framework integrates dense skip connections and attention mechanisms to improve feature propagation, preserve spatial information, and enhance lesion boundary delineation. Experiments were conducted using the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM). To address severe foreground-background imbalance, a composite loss function combining Dice loss, focal loss, and binary cross-entropy loss was employed during training. The proposed model achieved a Dice Similarity Coefficient of 0.9421, an Intersection over Union of 0.8905, an accuracy of 0.9711, and an AUC-ROC of 0.9878 on the validation dataset. Qualitative evaluation demonstrated accurate delineation of lesions with varying sizes and morphologies, while quantitative results confirmed robust discrimination between lesion and background regions. These findings demonstrate that DSU-Net provides accurate and reliable breast lesion segmentation in mammographic images and highlights the potential of attention-guided deep learning for computer-aided breast cancer screening and diagnosis. Subjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2606.06537 [q-bio.QM] (or arXiv:2606.06537v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2606.06537 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Reza Bozorgpour [view email] [v1] Wed, 3 Jun 2026 23:09:03 UTC (891 KB)

[CV-107] Advanced Flood Prediction with Physics-Guided Deep Learning: Combining UNet FNO and SAR/Optical Imagery

链接: https://arxiv.org/abs/2606.06524
作者: Tewodros Syum Gebre,Jagrati Talreja,Leila Hashemi-Beni
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper has been accepted for publication in the Proceedings of the IEEE Radar Conference (RadarConf 2026). The final authenticated version will be available through IEEE Xplore

点击查看摘要

Abstract:Accurate and scalable flood mapping remains challenging due to limited ground observations, heterogeneous terrain conditions, and the difficulty of enforcing hydrodynamic consistency within data-driven models. This work introduces a physics-guided deep learning framework that integrates multi-modal remote sensing (Sentinel-1 SAR, Sentinel-2 optical imagery, and DEM-derived terrain features) with constraints from the depth-averaged shallow water equations (SWE). The proposed hybrid architecture combines a UNet to capture fine-scale spatial details with a Fourier Neural Operator (FNO) to model basin-scale hydraulic interactions, while physics-informed residual losses ensure mass and momentum consistency. Evaluated across diverse floodplain settings, the hybrid model achieves an Intersection over Union of 0.82 and an F1 score of 0.90 for flood extent prediction, outperforming UNet-only and FNO-only baselines. Using hydrodynamic simulations as reference data, the model achieves an RMSE of 0.21 m for water depth and 0.15 m/s for flow velocity. Physics consistency is maintained, with low residuals and mass imbalance below 2.1%. Ablation studies confirm that removing physicsbased regularization significantly degrades performance, underscoring the value of physical constraints for stability and generalization. These results demonstrate that embedding hydrodynamic principles into deep learning yields more accurate, reliable, and physically coherent flood predictions, offering strong potential for operational monitoring and large-scale deployment.

人工智能

[AI-0] Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

链接: https://arxiv.org/abs/2606.07500
作者: Fatema Siddika,Md Anwar Hossen,Tanwi Mallick,Ali Jannesari
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages. arXiv admin note: text overlap with arXiv:2601.17616

点击查看摘要

Abstract:Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task Agnostic Continual Learning (SETA), a framework that resolves the plasticity-stability conflict through adaptive sparse subspace decomposition into task-specific expert modules. Unlike standard updates, where tasks compete for the same parameters, SETA separates knowledge into unique experts, designed to isolate task-specific patterns, and shared experts, responsible for capturing common features. This structure is maintained through adaptive elastic anchoring and a routing-aware regularization that jointly protect shared knowledge at both the weight and routing levels and enable a unified gating network to automatically retrieve the correct expert combination during inference. Extensive experiments across diverse domain-specific benchmarks demonstrate that SETA achieves competitive or superior overall performance relative to state-of-the-art continual learning baselines, with particularly strong retention of early-task knowledge and improved backward transfer on LLaMA-2 7B and Qwen3-4B.

[AI-1] welve quick tips for designing AI-driven HPC workflows

链接: https://arxiv.org/abs/2606.07491
作者: Jamie J. Alnasir
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 12 pages, 1 figure. Formatted using the bioRxiv LaTeX preprint style

点击查看摘要

Abstract:High-performance computing (HPC) clusters remain the backbone of large-scale scientific computation, traditionally executing deterministic, linear pipelines optimised for predictable performance. However, the pervasive integration of artificial intelligence (AI) and foundation models into scientific research has introduced a fundamentally new computational paradigm. AI-driven workflows are characteristically iterative, data-driven, and probabilistic, introducing unique challenges regarding data gravity, heterogeneous resource management, and complex workflow orchestration. This guide provides twelve practical tips designed to help researchers design efficient, scalable, and reproducible AI-driven HPC workflows. By addressing critical system-level bottlenecks - such as containerisation for environment portability, strategic deployment of job arrays, explicit feedback loop mechanics, and I/O optimisation for small files - this article offers a framework for transitioning from rigid execution pipelines to adaptive, intelligent computational environments. While these architectural principles are broadly applicable across distributed environments, they are particularly tailored to the resource-intensive throughput demands of modern computational biology. Comments: 12 pages, 1 figure. Formatted using the bioRxiv LaTeX preprint style Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) MSC classes: 68T05, 68W10 ACMclasses: C.2.4; D.1.3; I.2.6 Cite as: arXiv:2606.07491 [cs.DC] (or arXiv:2606.07491v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.07491 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] How AI Agents Reshape Knowledge Work: Autonomy Efficiency and Scope

链接: https://arxiv.org/abs/2606.07489
作者: Jeremy Yang,Kate Zyskowski,Noah Yonack,Jerry Ma
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity’s Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

[AI-3] Graph Neural Network leverag ing Higher-order Class Label Connectivity for Heterophilous Graphs

链接: https://arxiv.org/abs/2606.07475
作者: Takuto Takahashi,Itsuki Nakayama,Takahiro Mitani,Ryosuke Kikuchi,Yuya Sasaki,Makoto Onizuka
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Node classification in graph neural networks (GNNs) has been widely applied in various fields of graph analysis. GNNs achieve high-accuracy node classification in homophilous graphs, where nodes with the same class label tend to be connected. However, their performance remains limited in heterophilous graphs, where nodes with different class labels are more likely to be connected. In particular, current GNNs derived from graph convolutional networks cannot capture higher-order class label connectivity, which is frequently observed in real-world heterophilous graphs. To address this issue, we propose a novel classifier, Label Context Classifier (LCC), designed to capture higher-order class label connectivity in directed graphs. LCC estimates the class label of a target node by leveraging label context embeddings that are generated through four distinct types of walks. In addition, our approach allows the integration of LCC and any GNN by adaptively learning their importance. Experimental results demonstrate that GNNs integrated with LCC outperform SOTA methods and the label context embeddings improve the node classification performance in heterophilous directed graphs.

[AI-4] Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

链接: https://arxiv.org/abs/2606.07473
作者: Georgii Aparin,Vadim Popov,Tasnima Sadekova,Assel Yermekova
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper’s internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.

[AI-5] Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLM s and Agent ic Harnesses in Research Lifecycle

链接: https://arxiv.org/abs/2606.07462
作者: Jiayu Wang,Weijiang Lv,Bowen Fu,Jing Fu,Jiayi Song,Lingyu Zhang,Lanxuan Xue,Luodi Chen,Zepeng Xin,Kaiyu Li,Xiangyong Cao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike existing benchmarks that primarily assess macro-level execution capabilities, AARR focuses on whether agents can emulate the professionalism, thoroughness, and nuanced reasoning that characterize human researchers in granular research scenarios. In this work, we propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in this series. We conduct extensive experiments across frontier models and agentic systems, revealing that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success rate, frequently overlooking subtle yet critical details that are obvious to real human researchers. Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding. Our data is released at this https URL.

[AI-6] Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

链接: https://arxiv.org/abs/2606.07412
作者: Chuan Xiao,Zhengbo Jiao,Shaobo Wang,Wei Wang,Bing Zhao,Hu Wei,Linfeng Zhang,Lin Qu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures. Under review

点击查看摘要

Abstract:LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent’s own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop self-evolution framework that reuses the agent’s historical solving traces as a source of training signal. Rather than treating traces only as evidence for reward computation, Socratic-SWE distills them into structured agent skills that summarize recurring failures and effective repair patterns. These skills then guide the generation of targeted repair tasks in real repositories. Candidate tasks are checked through execution-based validation and scored with a solver-gradient alignment reward, so that the retained tasks are both verifiable and useful for improving the Solver. The updated Solver produces new traces, enabling the task curriculum to adapt over successive rounds. Across SWE-bench Verified, SWE-bench Lite, SWE-bench Pro, and Terminal-Bench 2.0, Socratic-SWE consistently improves over self-evolving baselines under the same compute budget, reaching 50.40% on SWE-bench Verified after three iterations. These results suggest that solving traces can serve as a scalable substrate for self-evolving SWE agents.

[AI-7] A Comprehensive Anatomy of Human and DeepSeek -R1 LLM Mathematical Reasoning

链接: https://arxiv.org/abs/2606.07410
作者: Yuxiang Chen,Jun Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of “Aha moments” in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoning across all 30 problems from AIME 2025, exhaustively annotating 10,247 reasoning steps into five functional categories: Analysis, Inference, Branch, Backtrace, and Reflection. We find a clear structural difference. Human solutions maintain a compact alternation between analysis and deduction, whereas DeepSeek-R1 frequently revisits intermediate results, performs shallow and often unnecessary verification, and loops through local checks without meaningful logical progress. We describe this as topological mimicry: reproducing the surface form of reasoning without its functional role. Despite this, we identify two signals of genuine reasoning. First, successful traces exhibit stable use of branching and backtracking, while failed traces either underuse or overuse exploratory actions. Second, reflection is only effective when placed within deductive inference; reflections trapped in analysis loops focus on local numerical details while missing global logical errors. These findings suggest that current long-CoT models may be rewarded more for the appearance of reasoning than for genuine deductive progress. We discuss directions for improving evaluation and training, including measuring cross-trace stability, penalising “spinning-wheel” traces, encouraging deeper logical correction, and reallocating inference-time compute toward deduction and backtracking. Overall, reasoning quality depends not simply on how much reflection occurs, but on whether reflection appears consistently and at the appropriate logical scale.

[AI-8] Online Pandoras Box for Contextual LLM Cascading

链接: https://arxiv.org/abs/2606.07392
作者: Alexandre Belloni,Yan Chen,Yehua Wei
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora’s Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora’s Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman’s policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent \widetilde O(\sqrt T) cumulative regret over a horizon of T periods. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML) Cite as: arXiv:2606.07392 [cs.AI] (or arXiv:2606.07392v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.07392 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-9] A robust PPG foundation model using multimodal physiological supervision

链接: https://arxiv.org/abs/2606.07365
作者: Eloy Geenjaar,Vince Calhoun,Scott Daly,Gouthaman KV,Lie Lu,Trisha Mittal,Daniel P. Darcy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining paradigms that require curated data and thus complicate generalization to field-like data, or use closed-source field-like PPG data. In contrast, we propose a PPG foundation model that does not require high-quality or field-like pretraining data, and instead leverages accompanying electrocardiogram and respiratory signals in ICU datasets to select contrastive samples during pretraining. Our approach allows the model to retain and learn from noisy PPG segments, improving robustness at inference. Our model, pretrained on 3x fewer subjects than existing state-of-the-art approaches, achieves performance improvements on 14 out of 15 diverse downstream tasks, including field-like daily activity and heart rate prediction. Our results demonstrate that multimodal supervision can integrate complementary physiological information to improve the robustness of PPG foundation models and enhance their generalization to consumer-grade data.

[AI-10] SleepExplain: Explainable Non-Rapid Eye Movement and Rapid Eye Movement Sleep Stage Classification from EEG Signal

链接: https://arxiv.org/abs/2606.07351
作者: Rafsan Jany,Md. Hamjajul Ashmafee,Iqram Hussain,Md Azam Hossain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures, 2022 25th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:Classification of sleep stages is one of the most important diagnostic approaches for a variety of sleep-related disorders. Electroencephalography (EEG) is regarded as a powerful tool for examining the association between neurological effects and sleep phases since it correctly identifies sleep-related neurological alterations. During Non-Rapid Eye Movement (NREM) and Rapid Eye Movement (REM) sleep phases, a number of nerve and bodily functions are affected and therefore hold an important role both in their functionalities. This work aims to classify NREM and REM sleep stages from sleep EEG data and present a noble SleepExplain model, an explainable NREM and REM sleep stage classification to explain its predictions. In this work, sleep stages were classified using Random Forest, XGBoost, and Gradient Boosting ensemble classification models. Overall, we obtained an accuracy of 92.54% (Random Forest), 94.25% (Gradient Boosting), and 94.30% (XGBoost). For explainable classification model, we utilized a game theoretic approach, SHAP (SHapley Addictive exPlanations) to offer a convincing explanation for the prediction.

[AI-11] Off-Policy Evaluation with Strategic Agents via Local Disclosure

链接: https://arxiv.org/abs/2606.07308
作者: Kiet Q. H. Vo,Abbavaram Gowtham Reddy,Julian Rodemann,Siu Lun Chau,Krikamol Muandet
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study off-policy evaluation (OPE) under strategic behavior where decision subjects (or agents) respond to a decision maker’s policy by strategically modifying their covariates. Such behavior induces a policy-dependent covariate shift, breaking the standard assumption in existing methods that covariates are exogenous to the policy. Related work addresses this challenge by imposing strong assumptions such as repeated interactions or full knowledge of agents’ response behavior, substantially limiting its applicability to OPE. In contrast, we consider a one-shot OPE setting where the decision maker has only partial knowledge of the agents’ response behavior. Our key insight is that disclosing local information through post-hoc explanations reveals agents’ pre-strategic covariates prior to adaptation, mitigating the information loss induced by strategic behavior. Leveraging this structure, we estimate a statistical model for the agents’ responses and construct a doubly robust estimator for policy value. By assuming that the agents’ cost sensitivity follows a conditional log-normal distribution, we establish consistency of the proposed estimator and validate our approach empirically. More broadly, our results highlight how interaction design can mitigate information asymmetry by revealing otherwise hidden structure in agents’ strategic responses.

[AI-12] DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

链接: https://arxiv.org/abs/2606.07299
作者: Lingyong Yan,Can Xu,Yukun Zhao,Wenxuan Li,Qingyang Chen,Jiulong Wu,Wenli Song,Xiangnan Li,Weixian Shi,Yiqun Chen,Xuchen Ma,Yuchen Li,Jiashu Zhao,Shuaiqiang Wang,Jianmin Wu,Dawei Yin
类目: Artificial Intelligence (cs.AI)
备注: Technical report by the DuMate Team. 26 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.

[AI-13] Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path ICML2026

链接: https://arxiv.org/abs/2606.07271
作者: Thomas Sesmat,Gabriel Meseguer-Brocal,Geoffroy Peeters
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: ICML 2026 article, 9 main pages and 25 with annexes, 11 figures

点击查看摘要

Abstract:Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond verbatim reproduction, models can encode subtler traces of their training data that never surface in their outputs yet remain exploitable. We study this regime for Rectified Flows, which are increasingly used in deployed generative systems. We analyse the interpolation path X_\lambda = (1-\lambda)X_0 + \lambda X_1 that defines the Rectified Flow training. We show that a gap exists between the reconstruction of train and test data that follows a bell-shaped curve over \lambda , wich accumulates during training, while the validation metrics remain stable. The signal has a maximum whose location we derive in closed form under Gaussian assumptions. We validate these predictions on both audio and images and show that the bell-shaped structure is universal, while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we exploit this specific \lambda -resolved structure to perform a Membership Inference Attack, distinguishing members of the training set from non-members.

[AI-14] OPSIS-RAD: Ranking According to Desires

链接: https://arxiv.org/abs/2606.07253
作者: Leonardo Fernandes Costa,Helder Gomes Costa,Diogo Lima,Brunno Rodrigues
类目: Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注: 21 pages, 15 Tables and 6 figures. The numerical computation of the data that appear in the Toy Examples was Supported by the Visual TOPSIS RAD that is available at this https URL . The data of the Toy examples are also available in this URL and can be loaded in the app as the template “Article”

点击查看摘要

Abstract:Traditional TOPSIS derives its reference points – the Positive Ideal Solution ( PIS ) and Negative Ideal Solution ( NIS ) – from the observed alternative set, making rankings susceptible to misalignment with decision-maker (DM) requirements, sensitivity to outlier performances, and rank reversal. This paper proposes TOPSIS-RAD, which addresses these issues by incorporating two arrays of DM-defined reference levels. Vetoed Performance Levels ( VPL ) exclude non-viable alternatives before normalisation, preventing them from distorting the ranking frontiers. Desired Performance Levels ( DPL ) cap performances at the DM’s desired level before normalisation, anchoring the PIS in explicit aspirations rather than dataset extremes. Three toy examples demonstrate each mechanism: VPL reshapes normalisation boundaries by removing a non-viable alternative; fixed DPL frontiers stabilise rankings by limiting the influence of performances well above the desired level. The method preserves the familiar distance-based structure of TOPSIS while grounding the ranking in stable, DM-specified boundaries. Limitations and future research directions are also discussed.

[AI-15] AI Sovereignty: A Qualitative Model of Strategic Competition as AI Becomes an Instrument of National Power

链接: https://arxiv.org/abs/2606.07245
作者: Timothy Clancy,Asmeret Naugle
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Main article: 19 pages, 10 figures. Supplementary: 19 pages, 7 figures, 7 tables. To be presented at the 2026 International System Dynamics Conference (ISDC), July 20-24, TU Delft, Delft, Netherlands

点击查看摘要

Abstract:AI sovereignty is the extent to which a nation independently controls its artificial intelligence (AI) technologies. The race toward ever-more-sophisticated frontier AI models is of increasing strategic importance, with nations considering how AI might improve their economic situations, competitive advantage, and overall national power. However, the costs of AI sovereignty are enormous, and we lack definitions and conceptual models to navigate evolving AI sovereignty dynamics. We address this gap with definitions relevant to AI sovereignty, along with a first-of-its-kind qualitative model that incorporates micro, meso, and macro contributors. Model-based qualitative forecasts highlight competitive dynamics and evolving potential for AI-driven national power. The model identifies key leverage points that nations can use to enhance their own growth or degrade an adversary’s, including consideration of accelerators, electricity, water, data sets and skilled workforce. These leverage points can be activated at strategic and operational levels through both direct kinetic actions, such as Iran’s targeting of data centers with drones, and indirect non-kinetic effects including cyber, space, information, economic coercion and diplomacy. If our assumptions and hypotheses are valid, this strategic competition may come to define how nations improve their economic situations, competitive advantage, and overall national power in the 21st Century.

[AI-16] An Abstract Architecture for Explainable Autonomy in Hazardous Environments

链接: https://arxiv.org/abs/2606.07211
作者: Matt Luckcuck,Hazel M Taylor,Marie Farrell
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Originally published 20th of October 2022 at the Second International Workshop on Requirements Engineering for Explainable Systems (RE4ES), which was hosted by the International Requirements Engineering Conference 2022

点击查看摘要

Abstract:Autonomous robotic systems are being proposed for use in hazardous environments, often to reduce the risks to human workers. In the immediate future, it is likely that human workers will continue to use and direct these autonomous robots, much like other computerised tools but with more sophisticated decision-making. Therefore, one important area on which to focus engineering effort is ensuring that these users trust the system. Recent literature suggests that explainability is closely related to how trustworthy a system is. Like safety and security properties, explainability should be designed into a system, instead of being added afterwards. This paper presents an abstract architecture that supports an autonomous system explaining its behaviour (explainable autonomy), providing a design template for implementing explainable autonomous systems. We present a worked example of how our architecture could be applied in the civil nuclear industry, where both workers and regulators need to trust the system’s decision-making capabilities.

[AI-17] RETROSPECT: RETROsynthesis via Sequential Prediction and Chemically Transformed-ranking ICML2026

链接: https://arxiv.org/abs/2606.07181
作者: Raja Sekhar Pappala,Shreyas Vinaya Sathyanarayana,Ronit Kumar Choudhary,Arjun Verma,Deepak Warrier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
备注: Accepted at the AI for Science workshop (ICML 2026)

点击查看摘要

Abstract:Single-step retrosynthesis needs both accurate first-ranked suggestions and candidate lists that are rich enough for downstream selection. We study this as a proposal-selection decomposition. Our system, RETROSPECT, combines a single Transformer proposal model, which we call the ChemAlign Transformer, with a LambdaMART reranker over structural, reaction-template, upstream-score, and optional DFT-derived descriptors. The generator is trained with hybrid root-aligned and random SMILES augmentation, Pre-LayerNorm, tied embeddings, exponential moving average weights, and a differentiable atom-balance auxiliary loss. On the full USPTO-50K test set of 5,007 reactions, the generator reaches 55.00% top-1 and 86.18% top-10 exact-match accuracy with 99.86% top-1 validity. On the merged candidate-pool benchmark used for reranking, which contains 5,007 test products and about 111 candidates per product, a LambdaMART model trained on the structural feature set reaches 59.4% top-1 with 0.7171 mean reciprocal rank. Feature ablations show that upstream proposal score and template-frequency statistics provide most of the reranking signal, while DFT and reaction-center DFT features provide smaller and less consistent gains. These results support a modular view of retrosynthesis: stronger single-model proposal and learned candidate selection are complementary, and the proposal model can serve as a drop-in component for ensemble systems such as RetroChimera (Maziarz et al., 2024)

[AI-18] hink Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

链接: https://arxiv.org/abs/2606.07157
作者: Dewi Gould,Francis Rhys Ward,Anders Cairns Woodruff,Rauno Arike,Josh Hills,Alex Serrano,Ida Caspary,Jason Ross Brown,Jo J. Jiao,Patrick Leask,Twm Stone,Ram Potham,Ionut Gabriel Stan,Harry Mayne,Simeon Hellsten,Shubhorup Biswas,Ariana Azarbal,William L. Anderson,Elle Najt,Ryan Greenblatt,Julian Stastny
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the 50% -task-completion time horizon (TH): the human time required for tasks a model completes with 50% success rate. We complement this with a 50% reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with 50% success rate. We find that the no-CoT 50% TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5’s TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.

[AI-19] REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

链接: https://arxiv.org/abs/2606.07141
作者: Anurag Sharma,Sai Teja Chunchu,Prasenjit Mitra,Sandipan Sikdar,Koustav Rudra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning patient-specific data is intractable, and retraining with minor data removal is resource-intensive. While there exists several machine unlearning methods that can be used, their utility is generally restricted to non-medical domains. Moreover, the existing benchmarks for evaluating such unlearning methods primarily utilize synthetically curated datasets, which are not truly representative of real-world systems. Hence, the effectiveness of these unlearning methods in the medical domain is largely unclear. To this end, we introduce REMEDI, an extensive benchmark for machine unlearning tailored to multi-label and multiclass clinical disease inference, where label correlations, longitudinal structure, and safety constraints make unlearning particularly challenging. Unlike the existing benchmarks, REMEDI considers: (1) a relevant application domain (medical), (2) comprehensive unlearning setups involving diverse sets of forget instances, (3) challenging unlearning scenarios including multi-label and multi-class classification tasks, and (4) evaluation metrics involving performance both in terms of utility and extent of unlearning achieved. REMEDI is developed using the MIMIC-III clinical database that contains comprehensive clinical data of patients. Experiments with existing unlearning methods indicate that there exists a trade-off between utility and unlearning performance. They are also largely unsuited to multi-label classification tasks. To facilitate reproducibility, we make our benchmark publicly available.

[AI-20] DIFFRACT: Neuralized Utility Maximization for Wireless Networks by Differentiable Programming

链接: https://arxiv.org/abs/2606.07114
作者: Chee Wei Tan,Siya Chen
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: IEEE INFOCOM 2026

点击查看摘要

Abstract:Next-generation wireless networks, including satellite-to-Open RAN systems, demand agile and intelligent resource management capable of handling dynamic multi-user interference under stochastic quality of service constraints. This paper introduces DIFFRACT, a neuralized utility maximization framework that leverages differentiable programming to integrate deep learning with optimization in wireless networks. Central to our approach is the exploitation of the mathematical structure of standard interference functions, which are foundational in wireless power control. By developing a duality theory for these functions, we map iterative interference management algorithms into differentiable neural network architectures via algorithm unrolling. This enables distributed, end-to-end gradient-based learning at the network edge, supporting real-time adaptation to interference in both terrestrial and non-terrestrial environments. DIFFRACT allows for scalable and robust utility maximization by modeling complex channel dynamics and leveraging the expressiveness of differentiable models. Experimental results confirm the framework’s theoretical soundness and practical effectiveness for next-generation wireless systems.

[AI-21] Beyond Post-hoc Explanation: Toward Glassbox AI via Probabilistic Mediation

链接: https://arxiv.org/abs/2606.07113
作者: Manuele Leonelli
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are rapidly becoming infrastructural components in high-stakes institutional settings, including public administration, legal reasoning, and healthcare, where opacity is not merely inconvenient but institutionally and legally untenable. Existing approaches to explainability are predominantly post-hoc, offering unstable, non-contestable accounts that have no formal relationship to the reasoning process that produced the output. We argue that the problem is not the absence of explanation but the absence of structured reasoning in the first place. This paper makes the case for a fundamentally different architecture, which we call the Glassbox Framework, in which Bayesian networks serve as transparent, ante-hoc mediation layers for generative models. Bayesian networks encode domain knowledge, causal assumptions, and probabilistic dependencies before inference occurs, enabling auditable reasoning traces, uncertainty quantification, and contestable outputs. We characterise the architecture of this framework and ground it in a benefit eligibility scenario, identifying the foundational challenges spanning semantic alignment, dynamic model construction, probabilistic grounding, and human governance that must be solved to realise it at scale. By shifting from post-hoc explanation to ante-hoc probabilistic mediation, this work outlines a principled path toward AI systems that are not only powerful but fundamentally accountable.

[AI-22] DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling ICML2026

链接: https://arxiv.org/abs/2606.07108
作者: Tengyao Tu,Yulin Li,Hui-Ling Zhen,Libo Qin,Zhoujun Wei,Jinghua Piao,Zhuotao Tian,Yong Li,Min Zhang
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as “overthinking”. Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM’s step-level embeddings. Building on this insight, we propose DyCon, a training-free framework that leverages latent step-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate that DyCon significantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization. Project page and code are available at this https URL.

[AI-23] MetaConfigurator: AI-Assisted RDF Authoring from JSON Data

链接: https://arxiv.org/abs/2606.07094
作者: Felix Neubauer,Mahdi Jafarkhani,Kenichi Endo,Jürgen Pleiss,Benjamin Uekermann
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted as post-proceedings for the deRSE26 conference

点击查看摘要

Abstract:Scientific workflows increasingly generate structured JSON data that is easy to exchange but difficult to interpret consistently across systems due to lacking semantic interoperability. While JSON Schema ensures structural validation, it provides no native support for Linked Data semantics. This paper presents an RDF Authoring View extending the open-source JSON Schema editor MetaConfigurator, enabling researchers to transform existing JSON, YAML, or CSV data into RDF using AI-assisted RML mappings, refine triples, execute SPARQL queries, visualize knowledge graphs, and export RDF serializations within a single integrated web interface. This workflow is supported by ontology-aware IRI auto-completion, bidirectional synchronization between JSON-LD text views and RDF triple tables, and AI-assisted SPARQL query generation from natural language hints. We demonstrate the workflow using laboratory data from metal-organic framework (MOF) synthesis experiments. Protocol data describing reagents, procedure steps, and quantities is converted from JSON to ontology-based JSON-LD via RML mappings. We then refine the semantic representation, query relationships between experimental conditions and outcomes, and explore the resulting knowledge graph interactively. This integrated environment bridges conventional structured data management with Semantic Web technologies while preserving experimental context and lowering technical barriers through AI assistance. Comments: Submitted as post-proceedings for the deRSE26 conference Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) MSC classes: 68T30 (Primary), 68U35 (Secondary) ACMclasses: I.2.4; H.3.5; H.5.2; D.2.12 Cite as: arXiv:2606.07094 [cs.SE] (or arXiv:2606.07094v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.07094 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-24] On the Geometry of On-Policy Distillation

链接: https://arxiv.org/abs/2606.07082
作者: Zhennan Shen,Yanshu Li,Qingyu Yin,Chak Tou Leong,Zhilin Wang,Yanxu Chen,Rongduo Han,Sunbowen Lee,Yi R. Fung
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

[AI-25] dots.tts Technical Report

链接: https://arxiv.org/abs/2606.07080
作者: Shi Lian,Changtao Li,Bohan Li,Hankun Wang,Da Zheng,Junfeng Tian,Yufeng Ma,Colin Zhang,Kai Yu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present this http URL, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, this http URL achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, this http URL also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.

[AI-26] SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

链接: https://arxiv.org/abs/2606.07074
作者: Zequn Xie,Junjie Wang,Dan Yang,Jie Feng,Yue Shen,Jian Wang,Jinjie Gu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures,

点击查看摘要

Abstract:Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.

[AI-27] Front-to-Attractors: Modifying the Front-to-Front Heuristic in Bidirectional Search

链接: https://arxiv.org/abs/2606.07047
作者: Alvin Zou,Muhammad Suhail Saleem,Maxim Likhachev
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Heuristics play a central role in the performance of bidirectional search algorithms, which commonly rely on two main classes. Front-to-end (F2E) heuristics estimate the distance from a state s to the target of the search (the goal for forward search or the start for backward search). In contrast, front-to-front (F2F) heuristics estimate the distance from s to the opposite search frontier using a pairwise function h(s, s’), where s’ ranges over frontier states. Although F2F heuristics are typically more informative and therefore reduce the number of node expansions, their reliance on extensive pairwise evaluations incurs substantial computational overhead. To address this limitation, we introduce a new heuristic class, front-to-attractors (F2A), that preserves much of the informativeness of F2F while dramatically reducing its computational cost. Rather than evaluating distances to all states on the opposite frontier, F2A estimates the distance from s to a small, dynamically maintained set of attractors in the opposite search direction. These attractors serve as a surrogate for the full frontier, enabling rich heuristic guidance at a fraction of the computational expense while maintaining the optimality guarantees offered by F2F. We evaluate F2A across multiple domains and show that it reduces the number of pairwise evaluations by up to 11.2x compared to F2F, while achieving 4.8x fewer node expansions than F2E on average.

[AI-28] StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents

链接: https://arxiv.org/abs/2606.07027
作者: Haojie Hao,Longkun Hao,Yihang Lou,Yan Bai,Zhenyang Li,Zhichao Yang,Dongshuo Huang,Hongyu Lin,Lanqing Hong,Jiakai Wang,Xianglong Liu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has become a promising approach for improving GUI Agents in long-horizon, stochastic digital environments, but trajectory-level success feedback is too sparse to provide reliable credit assignment for intermediate exploration steps. To mitigate this issue, recent studies introduce Process Reward Models (PRMs), which provide finer-grained training feedback through global milestone verification or local step-level evaluation. However, these methods still suffer from two level-specific limitations: global milestone decomposition is subjective and singular, making it difficult to accommodate the multiple valid execution paths in real GUI tasks, while fixed local judging windows may miss long-range key evidence or dilute the decision signal with irrelevant frames. Inspired by stain-tracing mechanisms in network flow analysis, we propose StainFlow, an entity-stain-flow process reward model for GUI Agents. To reduce the subjectivity of global partitioning, we introduce the Global Entity Stain Tracking module, which extracts visually verifiable task entities and tracks how their stain concentrations and states evolve along the trajectory, allowing task phases to be objectively separated by changes in the entity evidence flow. To improve the accuracy of local verification, we introduce the Local Stain Evidence Linking module. Centered on the triggering entities of each candidate key node, it retrieves relevant steps based on their stain concentrations and state changes, and dynamically constructs high-density evidence windows for verifying true key nodes. Extensive experiments on AndroidWorld and OGRBench show that StainFlow relatively improves online RL success by 3.2% and trajectory completion judgment accuracy by 1.8%.

[AI-29] owards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

链接: https://arxiv.org/abs/2606.07015
作者: Ziyu Zhang,Chunyu Qiang,Xiaopeng Wang,Yuxin Guo,Kang Yin,Wenjie Tian,Jingbin Hu,Tianlun Zuo,Zhao Guo,Teng Ma,Yuzhe Liang,Chen Zhang,Lei Xie
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

[AI-30] A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

链接: https://arxiv.org/abs/2606.07007
作者: Chenhao Zhang,Chris Lin,Su-In Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). While SAEs improve interpretability of neural networks by learning sparse feature representations, a principled definition of ‘‘concept’’ and ‘‘learning’’ remains unclear. We formalize concepts as sets of data points and cast concept learning as a set-alignment problem between human-defined and model-induced concepts. This formulation distinguishes three increasingly strong notions of learning – detection, separation, and approximation – and yields geometric conditions, error bounds, and capacity constraints for when concepts can be represented by individual neurons or multi-neuron units. It also provides a set-theoretic account for common SAE phenomena, including feature splitting, feature absorption, feature families, and hierarchical concepts. Finally, we connect concept learning and neuron interpretation through formal concept analysis, showing that the two directions need not agree and that their many-to-many structure can be organized by concept lattices. Experiments on synthetic data with ReLU and Top- K SAEs illustrate the theory and reveal the effects of SAE size and sparsity on concept learning.

[AI-31] DataEvolver: Automatic Data Preparation for Large Language Models through Multi-Level Self-Evolving

链接: https://arxiv.org/abs/2606.07001
作者: Chao Deng,Shaolei Zhang,Ju Fan,Xiaoyong Du
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality training data is essential to large language models (LLMs) and typically requires extensive and costly manual curation. Existing automatic data preparation methods rely on predefined pipelines or customized human instructions, which limits their adaptability to diverse data distributions and lacks principled guidance from high-quality examples. In this paper, we introduce DataEvolver, the first self-evolving data preparation system that automatically constructs pipelines to transform raw data into high-quality data. DataEvolver employs a multi-level mechanism to ensure both pipeline executability and effectiveness. At the operator level, it incrementally expands the operator set to construct a logical plan while resolving dependency conflicts. At the pipeline level, it instantiates logical plans into executable code and iteratively refines pipeline orchestration through a feedback loop that reduces the distribution gap between prepared data and high-quality examples. Experiments on seven benchmarks show that DataEvolver substantially improves data quality and achieves an average 10% gain in downstream LLM performance compared with training on original data, highlighting new opportunities for the iterative co-evolution of LLMs and data.

[AI-32] aching the Way Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

链接: https://arxiv.org/abs/2606.07000
作者: Shizhe Xiang,Ke An,Wenlong Yu,Yue Liu,Jian Luan,Pei Fu,Qilong Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks. Although policy distillation can offer dense guidance, external teacher based methods introduce substantial computational overhead, while answer conditioned tuning methods may expose answer-level information and induce shortcut-like generation behavior. To address these limitations, we propose PTD-PO, a Privileged Tutoring Distillation Policy Optimization framework for RLVR that provides dense guidance without exposing the answer to the student policy. Specifically, PTD-PO constructs structured privileged hints from spatial attention guidance and intermediate textual reasoning steps, and uses them through in-context learning to produce step-wise token-distribution supervision. The student is still optimized under the original answer-free context, and its failed rollouts are aligned with the hint-augmented reference model at the token-distribution level. To further stabilize distillation under the distribution shift between guided and unguided contexts, we introduce a Top-K Jensen-Shannon divergence objective that focuses alignment on informative token probabilities while reducing memory overhead. Experiments on LVLMs ranging from 2B to 8B parameters show that PTD-PO consistently outperforms RLVR and distillation baselines, mitigates entropy collapse, and improves complex multimodal reasoning performance.

[AI-33] Exploring Agent ic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

链接: https://arxiv.org/abs/2606.06976
作者: Yijin Zhou,Linqian Zeng,Xiaoya Lu,Wenyuan Xie,Dongrui Liu,Junchi Yan,Jing Shao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches mainly improve these behaviors through inference-time correction or coarse-grained reward signals based on decision outcomes and structured checklists, leaving the uncertainty characteristics of agent decisions underexplored. We observe that decision-oriented reinforcement learning tends to weaken the uncertainty separation between correct and incorrect actions, resulting in overconfident mistakes and weaker exploration signals. Therefore, we propose TRUST, which incorporates uncertainty quantification into reward design as a repulsive force for maintaining uncertainty separation, and labels lightweight key-turn annotations for unified post-training of multi-turn trajectories. Experimental results across diverse tool-use benchmarks show that TRUST consistently enhances both decision quality and agent performance while maintaining more reliable uncertainty estimates during optimization.

[AI-34] Accounting for Context: Shaping Moral Credences for Value Alignment

链接: https://arxiv.org/abs/2606.06972
作者: Jazon Szabo,Sanjay Modgil
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring that agent behaviours are aligned with human moral values inevitably raises the problem of how to account for the plurality of moral perspectives that societies – and even individuals – typically adopt. Work on moral uncertainty proposes mechanisms to fairly and democratically aggregate evaluations of actions across different moral theories. However, this paper argues that one needs to account for contextual factors when aggregating moral evaluations. For example, consequentialist perspectives assume an ability to accurately determine how an agent’s actions change the world; an assumption that often does not hold in real world settings. We, therefore, formalise agent decision making under moral uncertainty, while also accounting for these kinds of contextual factors. We thereby show that a seemingly commonsensical property – the weak Pareto principle – is violated. We argue that this apparent problem is, in fact, a variation of Simpson’s paradox, and hence reveals the limitations of aggregation mechanisms that ignore the impact of contextual factors.

[AI-35] Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

链接: https://arxiv.org/abs/2606.06941
作者: Laura Wynter,Nirvik Sahoo,Paul Griffin
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) now solve a wide range of expert-level exams at or above human level, yet remain brittle on specialised, evidence-intensive domains such as law. On these tasks, errors arise not only from gaps in world knowledge but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence. The most common aggregator over sampled chain-of-thought (CoT) traces, majority vote, returns the most popular answer regardless of whether its evidence is actually strongest. We propose to treat the selection of CoT reasoning fragments into a set of evidence as an explicit combinatorial optimisation problem, allowing well-supported but minority hypotheses to override noisy majorities, and to evaluate the approach on legal-reasoning benchmarks that are particularly sensitive to evidence quality. We introduce EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights (relevance, specificity, distinctiveness), and delegates a single adjudication call per question to a frontier model. We evaluate EP-HUBO on two evidence-intensive legal benchmarks using both simulated annealing on classical hardware and the Dirac-3 photonic entropy-quantum machine from Quantum Computing Inc. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.

[AI-36] Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

链接: https://arxiv.org/abs/2606.06923
作者: M. Danish Lim,I. Danial Bin Sharudin,Wen Han Chen,Cedric Lim,Laura Wynter
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We study orchestration mechanisms for tool-using AI agents in realistic customer-service workflows over an unstructured knowledge base. We argue that declarative agents – AI agents equipped with natural-language skill files appended to the system prompt – are an effective orchestration paradigm. Concretely, we compare (i) a DeclarativeAgent that reads three domain-specific skill files at inference time and decides its own control flow, (ii) an ImperativeAgent based on a programmatic state machine with explicit phases, and (iii) an unscaffolded baseline agent modeled after the \tau -Knowledge benchmark agent. Our ImperativeAgent is motivated by externalised-control inference as in Recursive Language Models and graph-based orchestration frameworks. We formalise the three agents as policy classes within a decentralised partially-observable Markov decision process and analyse their information-theoretic and structural properties; we then test the predicted differences empirically on five language models and two retrieval regimes. Our results show that retrieval quality is a dominant bottleneck for AI agents: when evidence is incomplete or skewed, all agents degrade substantially, and skill files cannot recover lost performance. Under high-quality retrieval, however, declarative skills consistently improve accuracy on procedural tasks and reduce orchestration errors, while the imperative state machine’s brittleness does not reliably improve task success or compliance.

[AI-37] he Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

链接: https://arxiv.org/abs/2606.06920
作者: Rahul Nair,Chun Tao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B) on mathematical reasoning tasks and uncover a critical vulnerability: Full Fine-Tuning (Full FT) actively harms performance in models under 300M parameters, often dropping accuracy below zero-shot baselines. This “negative transfer” makes Parameter-Efficient Fine-Tuning (PEFT) not just an efficiency preference, but a stability requirement. We find that while Low-Rank Adaptation (LoRA) and Weight-Decomposed LoRA (DoRA) perform comparably, their strengths vary by task; DoRA excels in complex reasoning (GSM8K), while LoRA dominates pattern matching (OrcaMath). In particular, Full FT is outperformed by LoRA on aligned models (Qwen2.5-0.5B) and even by simple 5-shot In-Context Learning on the smallest architectures (SmolLM2-135M). Based on these findings, we recommend defaulting to PEFT for all aligned sub-1B models and caution against Full FT for any architecture smaller than 500M parameters to prevent catastrophic forgetting. Reproduction of this work can be found at this https URL.

[AI-38] Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition

链接: https://arxiv.org/abs/2606.06893
作者: Yuyang Zhang,Xinyuan Han,Xudong Jiang,Run Wang
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Large language model agents increasingly rely on Skills to encode procedural knowledge, yet high-quality Skills remain costly to hand-write. This paper studies automatic Skill construction from heterogeneous interaction evidence, including demonstrations, agent trajectories, tool traces, and execution logs. We argue that trace-to-skill construction is not simple summarization tasks, because traces are fragmented, redundant, and may miss rare but safety-critical behaviors. To address this, we introduce RWSA, a workflow-oriented intermediate representation that decomposes Skills into Workflow structure, execution Semantics, and runtime Attachments, capturing task decomposition, control flow, verification, safety, rollback, and state management. Building on RWSA, we propose W2S, a framework that segments traces, induces local Skill drafts, aligns shared structures, reconciles branches, and compresses redundancy while preserving evidence and confidence annotations. Experiments on 70 Skills show that W2S improves behavioral replay consistency by 10.5% over summarization- and prompting-based baselines, highlighting the need to treat traces as executable runtime specifications rather than compressible text.

[AI-39] Neuro-Symbolic Learning for Long-Horizon Task Planning Under Complex Logical Constraints

链接: https://arxiv.org/abs/2606.06877
作者: Qiwei Du,Zitong Zhan,Shaoshu Su,Bowen Li,Yi Du,Zhipeng Zhao,Taimeng Fu,Sebastian Scherer,Jiaoyang Li,Chen Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task planning often suffers from severe efficiency bottlenecks when robots must reason over long-horizon action sequences under complex logical constraints, including object affordances, spatial relationships, and sequential action dependencies. Recent neuro-symbolic methods improve planning efficiency by learning object-importance scores to prune task-irrelevant objects, but they typically rely on fixed offline supervision generated from full search spaces. This creates a train-test mismatch: at deployment, the planner operates in pruned search spaces induced by the model’s own imperfect predictions, leading to exposure bias and degraded planning performance. To address this challenge, we formulate object-importance learning for task planning as an imperative learning-based bilevel optimization problem. The upper level optimizes a neural scorer, while the lower level solves a symbolic planning problem in the score-pruned search space. To stabilize this learning process, we introduce a 3R strategy into the lower-level planning, using parallel Repair, Restart, and Rollback recovery to provide reliable and adaptive feedback for upper-level learning. Experiments on three challenging benchmarks demonstrate state-of-the-art performance, including an 80.04% reduction in failure rate and a 57.14% reduction in planning time. We further validate the framework on a quadruped-based mobile manipulator in simulation and the real world, demonstrating its potential for efficient and deployable neuro-symbolic task planning.

[AI-40] Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models : Multi-Turn Interaction and Multimodal Treatment Plan Generation

链接: https://arxiv.org/abs/2606.06869
作者: Yunhan Wang,Yuda Wang,Zhiying Tu,Mingqiang Song,Li Song,Kun Li,Dianhui Chu,Bolin Zhang
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 9 figures, 5 tables, including supporting information

点击查看摘要

Abstract:Aim: Existing AI-assisted traditional Chinese medicine diagnostic tools suffer from opaque reasoning processes, passive interaction, and limited treatment plan presentation. This study proposes a knowledge-enhanced visual diagnostic system to improve the transparency and interpretability of syndrome differentiation and treatment. Methods: The system is built upon a Neo4j knowledge graph comprising 241 syndromes, 1,263 symptoms, and 2,485 relations. It incorporates a four-stage symptom matching pipeline (exact, semantic, fuzzy, and large language model verification), an information gain-driven proactive questioning strategy optimized with genetic algorithms, and a multimodal treatment presentation integrating artificial intelligence-generated illustrations, three-dimensional meridian-acupoint models, and evidence-based literature. Results: Knowledge graph constraints reduced non-standard outputs by 32%. Case studies validated the effectiveness of the interactive workflow across patient self-assessment, clinician-assisted diagnosis, and traditional Chinese medicine education. Automated paired-comparison evaluation across 30 cases further demonstrated significant improvements in diagnostic trust (Cohen’s d = 1.82, p 0.001), reduced cognitive load (improvements in four of five dimensions), and higher credibility of evidence-based references (4.21 vs. 2.95). Conclusions: The proposed system enhances the transparency of traditional Chinese medicine diagnostic reasoning and the interpretability of treatment plans through knowledge graph-driven visualization and multimodal interaction, offering a practical solution for trustworthy artificial intelligence-assisted traditional Chinese medicine applications. Comments: 29 pages, 9 figures, 5 tables, including supporting information Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.06869 [cs.AI] (or arXiv:2606.06869v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.06869 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yunhan Wang [view email] [v1] Fri, 5 Jun 2026 03:36:48 UTC (13,840 KB) Full-text links: Access Paper: View a PDF of the paper titled Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation, by Yunhan Wang and 7 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-41] Modeling Nonlinear Feature Interactions with Product-Unit Residual Networks CCS2026

链接: https://arxiv.org/abs/2606.06861
作者: Ziyuan Li,Uwe Jaekel,Babette Dellen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICCS 2026

点击查看摘要

Abstract:Understanding nonlinear feature interactions is crucial in science and engineering, yet standard multilayer perceptrons (MLPs) often capture such interactions only implicitly, leading to entangled representations that can impair robustness and interpretability. We investigate product-unit residual networks (PURe) that integrate multiplicative product units with residual connections to explicitly model cross-feature couplings while stabilizing optimization. We conduct a systematic evaluation on an interaction-driven synthetic benchmark and two real-world datasets, assessing predictive accuracy, robustness to Gaussian feature noise, and performance under limited training data, and we compare real- and complex-valued variants under a matched parameter budget. Beyond accuracy, SHapley Additive exPlanations (SHAP)-based interaction analyses show that PURe learns more concentrated and structurally coherent interaction patterns than MLP baselines. Overall, PURe achieves competitive or improved performance, better robustness and sample efficiency in low-data regimes, and enhanced interaction-level interpretability.

[AI-42] LLM Agent -Assisted Reverse Engineering with Quantitative Readability Metrics

链接: https://arxiv.org/abs/2606.06838
作者: Neil Archibald,Ruben Thijssen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic decompilers produce functionally correct but often unreadable C code. This paper addresses one stage of the reverse engineering workflow: improving the readability of decompiled code using LLM agents guided by quantitative metrics. We present a three-phase research evolution. Phase 1 (tool-driven steering via Ghidra MCP) suffered from incomplete coverage and inconsistent improvements due to lack of quantitative guidance. Phase 2 (structural similarity validation alone) revealed that agents optimize for metrics in unintended ways, producing structurally equivalent but less readable code. Our contribution is the Quantitative Readability Score (QRS) framework, a composite metric combining a structural similarity gate with three independent readability sub-metrics (Lexical Surprisal, Structural Simplicity, and Idiomatic Quality). We demonstrate that QRS-guided refinement enables LLM agents to make targeted readability improvements without sacrificing correctness. We provide a discussion of the broader reverse engineering workflow (binary lifting, decompilation cleanup, and achieving functional equivalence) as context, however, it remains out of scope. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: D.2.8; D.2.7; I.2.11 Cite as: arXiv:2606.06838 [cs.SE] (or arXiv:2606.06838v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.06838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-43] Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

链接: https://arxiv.org/abs/2606.06833
作者: Jiani Xie,Andrew C. Cullen,Paul Montague,Benjamin I. P. Rubinstein
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, where transcription decisions are inherently made on incomplete information. This causal constraint serves as an information bottleneck on attackers, significantly limiting attack performance. Our new Semantic Gambit attack breaks this causal limitation by augmenting the adversary with predictive context derived from a Large Language Model in real-time. Our experiments show that this form of augmentation can elevate the corpus-level Word Error Rate to 35.6% – a three-fold increase over the current state-of-the-art. Ultimately, this work reveals how common, low-latency LLM tooling can be exploited to systematically subvert real-time ASR pipelines.

[AI-44] PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance

链接: https://arxiv.org/abs/2606.06823
作者: Yuqi Li,Siyuan Liu,Bingjun Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While deep learning has excelled in various domains, its application to sequential decision-making in finance remains challenging due to the low Signal-to-Noise Ratio (SNR) and non-stationarity of financial data. Leveraging the reasoning capabilities of Large Language Models (LLMs), we propose \textbfPandaAI, a closed-loop neuro-symbolic LLM agent with market regime modeling and constrained alpha generation, which bridges general LLM reasoning with financial rigor and suppresses the financial toxicity of LLM-generated outputs. To bridge the gap between general linguistic capability and financial rigor, we fine-tune a domain-specific LLM. Furthermore, we integrate this LLM into a modular architecture and form a closed-loop system. Unlike traditional models that optimize isolated prediction metrics, \textbfPandaAI is designed as a neuro-symbolic agent that navigates the complex, real-world financial environment with explicit risk awareness. Extensive experiments on CSI 300 stock data show that \textbfPandaAI achieves a 18.2% higher Rank IC and 25.7% lower maximum drawdown than state-of-the-art time-series models. Our constrained LLM generation and dual-channel adaptation method provide a general paradigm for LLM deployment in high-stakes sequential decision-making scenarios.

[AI-45] SCALE: Scalable Cross-Attention Learning with Extrapolation for Agent ic Workflow Scheduling

链接: https://arxiv.org/abs/2606.06820
作者: Zhifei Xu,Jierui Lan,Zixuan Liang,Aiji Liang,Jinxi He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to Computer Networks

点击查看摘要

Abstract:Agentic Large Language Model (LLM) systems decompose complex tasks into workflow Directed Acyclic Graphs (DAGs) whose primitives must be scheduled on heterogeneous clusters. Existing deep reinforcement learning (DRL) schedulers are tied to a fixed cluster size and require retraining whenever the number of servers changes. We propose SCALE (Scalable Cross-Attention Learning with Extrapolation), a DRL scheduler that generalizes to unseen cluster scales without fine-tuning. SCALE employs a cross-attention pointer network where task features query against server features, so the architecture accepts any number of servers by construction. We observe, however, that permutation-invariant architecture alone does not guarantee good performance at new scales - the attention feature undergoes distribution shift as the server count grows. To counter this, we introduce Structured Representation Regularization (SRR): a decorrelation loss combined with a KL penalty toward the standard normal, which keeps feature statistics stable regardless of input size. Trained on 16 nodes and tested directly on 32 and 48 nodes, SCALE reduces average response time by 8.9% at N=48 relative to the same architecture without SRR, confirming that explicit regularization is necessary to close the scale-generalization gap.

[AI-46] Lane Change Trajectory Planning for Personalized Driving Comfort and Mobility Efficiency

链接: https://arxiv.org/abs/2606.06805
作者: Haoxuan Dong,Dongjun Li,Ziyou Song
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted by the IEEE Intelligent Vehicles Symposium (IEEE IV 2026), Detroit, MI, United States, June 22_25, 2026

点击查看摘要

Abstract:Lane changing entails simultaneous longitudinal and lateral motions that affect driving comfort and mobility efficiency. Because these motions are tightly coupled and subject to substantial inter-vehicle variability, trajectory planning for lane-change maneuvers is characterized by a highly personalized nature. This study proposes a neural network-driven planner that integrates a third-order polynomial trajectory generator with a learning module that infers optimal trajectory parameters across diverse driving conditions. Using a shared backbone with dual heads, one head ensures all-condition operational guarantees, while the other captures driver-specific preferences for comfort or mobility efficiency. A head-gated switching mechanism, realized through a statistical gate based on error-winner logistic regression, adaptively selects the appropriate head under varying driving conditions, which enables context-aware lane-change trajectory planning. Representative cases and Monte Carlo simulations show that the proposed planner achieves personalized comfort and mobility during lane changes, while the baseline ensures feasible trajectories under driving conditions where personalized data are insufficient or inaccessible.

[AI-47] AdMem: Advanced Memory for Task-solving Agents

链接: https://arxiv.org/abs/2606.06787
作者: Runzhe Wang,Huilin Lu,Shengjie Liu,Li Dong,Jason Zhu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show promise as tool-using agents but remain limited in long-horizon tasks that require remembering, organizing, and reusing knowledge. Prior memory approaches aim to resolve the situation, but mainly focus on storing factual information. Recent work on procedural memory improves task reuse, yet often reduces to replaying past successes without addressing failure cases or online scalability. We introduce a unified and automatic memory framework that integrates semantic, episodic, and procedural memory in a bi-level design combining short-term and long-term stores. A multi-agent architecture with actor, memory, and critic agents enables automatic memory generation, reward annotation, and adaptive retrieval. Long-term memory is managed through reward-based evaluation, merging, and pruning, ensuring scalability and continual improvement. Experiments across various environments show that our approach improves robustness and success on long multi-turn tasks compared to existing baselines. This work highlights the importance of comprehensive, adaptive memory for advancing LLM-based agents.

[AI-48] What Your Posts Reveal: A Benchmark and Agent ic Framework for User-Level Privacy Leakage on Social Media

链接: https://arxiv.org/abs/2606.06784
作者: Zifan Peng,Yini Huang,Aiwen Lu,Qiming Ye,Peixian Zhang,Jingyi Zheng,Yule Liu,Xuechao Wang,Xinlei He,Jiaheng Wei
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Public social media posts can reveal private information through weak cues scattered across text, images, or metadata. Such leakage is often cumulative and cross-post: cues that appear harmless in isolation may jointly expose a user’s home, workplace, or routine. However, current research lacks a unified benchmark for user-level multimodal privacy leakage and an evaluation metric that captures exposure severity beyond binary accuracy. To address these gaps, we propose SopriBench, a synthetic benchmark guided by leakage patterns abstracted from a private reference corpus of Rednote and Instagram accounts, covering 50 user profiles and 1,569 images with attributes, contextual sensitivity, granularity, leakage type, inference difficulty, and supporting evidence. We further introduce the Privacy Exposure Score (PES), which weights value granularity by contextual sensitivity. Inspired by abductive reasoning, we introduce Argus, a training-free agentic framework for cumulative leakage inference. Argus forms hypotheses from accumulated evidence, verifies supporting evidence, and aggregates cross-post cues into privacy profiles, achieving 0.55 PES, a 25% improvement over the strongest baseline, with the largest gain on cross-post leakage. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2606.06784 [cs.CR] (or arXiv:2606.06784v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.06784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-49] AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

链接: https://arxiv.org/abs/2606.06761
作者: Jiyun Jang,Yujin Sung,Woosung Joung,Daewon Chae,Sangwon Lee,Sohwi Kim,Jinkyu Kim,Jungbeom Lee
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to Robotics: Science and Systems (RSS) 2026

点击查看摘要

Abstract:Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail to reliably execute correct low-level actions under distribution shifts. For example, even in a simple pickup task with identical scene layouts, camera viewpoints, and illumination, performance can degrade substantially when the object is placed at unseen locations. We argue that this gap arises from insufficient action understanding, namely the inability to interpret the robot’s base-frame action coordinate system in image space. To address this issue, we introduce AxisGuide, a lightweight guidance method that bridges semantic scene understanding and action-coordinate interpretation. Using camera parameters and end-effector poses, AxisGuide renders the robot base-frame axes in each camera view and augments RGB observations with a small set of cue channels that explicitly visualize the meaning of the +x, +y, and +z motions in image space. Extensive evaluations in both the LIBERO simulation and real-world environments demonstrate that AxisGuide yields substantial performance gains and improved generalization, highlighting the effectiveness of explicit action-coordinate cues for learning reliable and transferable generalist visuomotor policies.

[AI-50] A Geometric Account of Activation Steering through Angle-Norm Decomposition

链接: https://arxiv.org/abs/2606.06735
作者: Georgii Aparin,Tatiana Gaintseva
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token’s angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.

[AI-51] SCOUT: Semantic scene COverag e via Uncertainty-guided Traversal ICRA

链接: https://arxiv.org/abs/2606.06721
作者: Junyu Mao,Sara Ayoubi,Vishnu D. Sharma,Ilija Hadžić,Matthew Andrews
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 2026 ICRA Workshop on Uncertainty in Open World Robotics

点击查看摘要

Abstract:Robots that operate over extended periods should not merely visit space; they should progressively understand it. Yet most 3D scene graph pipelines treat perception as a post-processing stage over a fixed dataset, decoupling scene representation from the decisions that determine what is observed in the first place. We present SCOUT, an online semantic exploration framework that closes this loop by coupling active traversal with probabilistic scene graph construction. Given a prior 2D occupancy map and posed RGB-D observations, SCOUT incrementally builds an uncertainty-aware 3D scene graph whose nodes maintain fused geometry and posterior beliefs over open-vocabulary object labels, while edges encode structural relations such as on, inside, belong, and next to. These beliefs are fed back to an uncertainty-guided traversal planner, which selects viewpoints by balancing expected semantic certainty gain, geometric coverage gain, and travel cost. In this way, the robot revisits ambiguous objects when additional evidence matters and expands into unseen free space when the scene remains incomplete. The resulting system treats semantic scene completeness as an operational objective rather than a passive by-product of semantic mapping, moving toward autonomous agents that can patrol, update, and reason about evolving indoor environments with minimal human intervention.

[AI-52] MSAIC-Net: A Multi-Scale Attention and Imbalance-Aware Contrastive Network for ECG-Based Myocardial Substrate Abnormality Detection

链接: https://arxiv.org/abs/2606.06718
作者: Canyu Lei,Fenglin Zhang,Derek Bivona,Cristiane Singulane,Jonathan Pan,Kenneth Bilchick,Amit R. Patel,Jianxin Xie
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Myocardial substrate abnormalities, such as myocardial scar and myocardial infarction (MI), are associated with adverse cardiovascular outcomes. Electrocardiography (ECG) provides a low-cost and widely available tool for detecting these abnormalities, but ECG-based detection remains challenging due to heterogeneous lead-dependent manifestations, high-dimensional multi-lead signals, class imbalance, and the limited interpretability of deep learning models. We propose a multi-scale attention-enhanced convolutional network (MSAIC-Net) for ECG-based myocardial substrate abnormality detection. MSAIC-Net employs parallel atrous convolutional branches to extract ECG features across multiple temporal receptive fields. %, enabling the model to capture both local and longer-range temporal patterns. Channel attention is then used to adaptively reweight informative lead-wise and feature-channel representations. To address class imbalance and improve feature separability, we introduce a novel imbalance-aware supervised contrastive learning strategy that encourages samples from the same class to form compact representations while increasing separation between abnormal and normal samples. Lead-wise permutation importance is further incorporated to quantify the contribution of each ECG lead and improve model interpretability. The proposed method was evaluated on two complementary datasets: a low-data institutional cohort from the University of Virginia (UVA) Health System for myocardial scar classification and the large-scale public PTB-XL dataset from PhysioNet for MI identification. Experimental results show that MSAIC-Net outperforms baseline models, with particularly pronounced improvements in the low-data UVA cohort. Overall, the proposed framework provides an effective and interpretable approach for ECG-based detection of myocardial substrate abnormalities.

[AI-53] ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

链接: https://arxiv.org/abs/2606.06717
作者: Saket Reddy,Shiwei Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets, such as the historically “undruggable” oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow-pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape “lid” volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state-of-the-art generative models reveals weaker predicted binding affinity on these low-concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.

[AI-54] he Geography of Algorithmic Judgment: LLM Intermediaries Place Identity and Racial Steering in Housing Search

链接: https://arxiv.org/abs/2606.06694
作者: Hana Samad,Trung Lam,Christoph Mügge-Durum,Michael Akinwumi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages with supplemental tables and figures, AIES '26 Submission

点击查看摘要

Abstract:Large language models (LLMs) are rapidly assuming an intermediary role in housing search through the integration of listing platforms within conversational interfaces, mediating access to information, search, and recommendations within urban settings. We expand on prior work on racial steering in LLMs by conducting a behavioral audit of seven open-weight and closed-source LLMs across four U.S. cities, testing location recommendations across three iterative prompting conditions that progressively add lifestyle preference context and reflect fair housing paired-testing methodologies. We find that steering is an emergent behavior of the model’s interpretive license rather than primarily a static property. Steering results from the interaction of a user’s identity, preference articulation, and the spatial logic that a model has internalized about learned representations of place, preference, and opportunity in a given city, and how different types of users relate to it. While steering was present, it was not uniform in direction or magnitude across evaluated conditions. Preference-conditioned testing often increased or reconfigured the number of models that exhibited steering behaviors relative to baseline conditions, suggesting that LLMs may interpret what the same housing preference means differently depending on the racial identity of the user. Our findings also demonstrate that the city is not a neutral testing unit for LLM evaluation in place-based sectors, and results from one local market cannot be assumed to generalize to another. Local and domain expertise will be required in the housing sector to ensure that legal and institutional commitments to fair housing are not undermined while adopting AI tools that mediate spatial access.

[AI-55] AEGIS: A Backup Reflex for Physical AI

链接: https://arxiv.org/abs/2606.06660
作者: Josef Chen
类目: Artificial Intelligence (cs.AI); Performance (cs.PF); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it cannot recover. The failure is often visible before it happens. We introduce AEGIS (Activation-probe Early-warning, Gated Inference Switching), a selective escalation method that uses a lightweight probe on a weak policy’s frozen activations to detect high-risk steps while there is still time to act. When the probe flags a step, control switches to a stronger separate policy, but only for the steps that need it. On LIBERO-Spatial, AEGIS recovers 10.1% of the trajectories the weak policy alone loses, versus 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo. These gains are significant under one-sided exact paired McNemar tests with Holm-Bonferroni adjustment over three pre-registered contrasts: +5.4pp over blind escalation, p=8.5e-6; +5.0pp over random triggering, p=1.0e-4; paired-trajectory bootstrap CIs exclude zero. AEGIS activates the stronger policy on only 38% of steps, so the lever is timing rather than compute. The probe clears its precondition with an early-window AUROC of 0.764, 95% CI [0.70, 0.84], read from the weak-policy path over the first 30% of trajectory steps before any handoff. We pre-register the full analysis plan, including a conditional recovered-task-rate estimand and explicit kill criteria, and confirm the result on 700 common-random-number episodes per arm, with nA-fail=646.

[AI-56] A Study of Parallel Continuous Local Search

链接: https://arxiv.org/abs/2606.06656
作者: Cody J Christopher,Charles Gretton
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We study parallel Continuous Local Search (CLS) as a solution approach for Boolean satisfiability problems with symmetric pseudo-Boolean (PB) constraints. Here, the n -variable PB-satisfiability problem is relaxed to a continuous optimisation problem with a differentiable objective function on an n -dimensional hypercube. For satisfiable instances, the global minimisers of this optimisation problem correspond to satisfying assignments of the SAT problem at hand. We present several novel findings via empirical experiments: (i) redundant constraints can inhibit rather than accelerate convergence; (ii) CLS shows promise as a sub-solver in hybridised settings, quickly completing partial assignments; and (iii) local search rapidly converges to a stable distribution of solution quality (i.e., degree of satisfaction), due to saddle-dense objectives where additional solver steps yield diminishing returns. Our findings inform practical uses of CLS for SAT on modern accelerator hardware.

[AI-57] Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

链接: https://arxiv.org/abs/2606.06641
作者: Cody J Christopher,Charles Gretton
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CLS). AFSAT realises the proof-of-concept approach, FastFourierSAT, into a fully-engineered solver supporting any heterogeneous mixture of symmetric constraint types and lengths within a single problem instance. Using the JAX compiler, AFSAT leverages pure function composition, automatic vectorisation, automatic differentiation, and just-in-time (JIT) compilation to perform massively parallel CLS across batches of candidate assignments. We demonstrate substantially improved numerical stability, runtime performance, and memory efficiency over the proof-of-concept. We achieve this by way of identifying and addressing various limitations that arise from memory latency and floating-point representation, as well as leveraging automatic parallelisation and compact representations. The inherent representational and stability limitations of floating point are partially addressed by a tailored discrete Fourier transform implementation. We achieve near-linear throughput when scaling to multiple accelerators via JAX array sharding.

[AI-58] ChronoForest: Closed-Loop Multi-Tree Diffusion Planning for Efficient Bridge Search and Route Composition

链接: https://arxiv.org/abs/2606.06618
作者: Jungmin Seo,Jaesik Park
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 40 pages, 4 figures, 7 tables, 3 algorithms

点击查看摘要

Abstract:How can we plan long-horizon routes that reach designated goals, visit required waypoints, and remain short when only short-horizon offline trajectories are available? This problem matters in offline navigation because collecting sufficiently rich long-horizon data is difficult, yet real agents must still solve long-range tasks with route-level efficiency rather than mere feasibility. The difficulty is twofold: at the microscopic level, composing many short-horizon segments creates a trade-off between search cost and path quality, while at the macroscopic level, waypoint ordering requires comparing pairwise travel costs among start, goal, and waypoint anchors that are unknown before planning and increasingly unreliable when estimated only from long-range temporal distance. In this paper, we propose ChronoForest, a closed-loop planning system that couples local bridge search and online route re-solving through an anchor-chaining tree diffusion planner and an online multi-tree orchestrator. ChronoForest uses temporal distance for short-range guidance and node evaluation, while using search-time bridge evidence to validate long-range anchor connectivity and repeatedly re-solve the route. On OGBench AntMaze-Stitch, ChronoForest achieves 99.8%, 99.3%, and 99.5% success on the medium, large, and giant splits and improves giant-stitch success by up to 34.5 points over prior reported diffusion-based results. On Hamiltonian route-composition benchmarks, online re-solving corrects poor temporal orderings and improves route quality while remaining substantially cheaper than exhaustive planning. Comments: 40 pages, 4 figures, 7 tables, 3 algorithms Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.6; I.2.8; I.2.9 Cite as: arXiv:2606.06618 [cs.RO] (or arXiv:2606.06618v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.06618 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-59] FIGMA: Towards FIne-Grained Music retrievAl ACL2026

链接: https://arxiv.org/abs/2606.06615
作者: Nishit Anand,Ashish Seth,Sreyan Ghosh,Dinesh Manocha,Ramani Duraiswami
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted to ACL 2026. Project Website: this https URL

点击查看摘要

Abstract:Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.

[AI-60] Generative Models Erode Human Temporal Learning Through Market Selection ICML2026

链接: https://arxiv.org/abs/2606.06572
作者: Wenjun Cao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:We argue that modern generative models create structural risks for knowledge and cultural production at current, sub-AGI capability levels. We define Human Temporal Learning (HTL) as path-dependent knowledge accumulation through sustained engagement with problems over time. Generative outputs increasingly resemble HTL-intensive work in surface features, so verifying whether a given output reflects genuine human learning grows costly relative to its expected benefit. Once verification loses economic justification, evaluators reward outputs regardless of production mode, and producers who invested years of learning compete on price against outputs that cost almost nothing to generate. We call this pathway value collapse and formalize it through a costly-inspection framework. Cross-domain evidence from academic publishing, legal practice, content platforms, and software security maps onto four stages of verification erosion. Alignment success is orthogonal. Better-aligned models narrow observable gaps between human and AI outputs, making source verification harder and intensifying competitive pressure against HTL-intensive work even when individual AI outputs improve.

[AI-61] MalTree: Tracing Malware Evolution from Embeddings at Scale ICML2026

链接: https://arxiv.org/abs/2606.06570
作者: Akash Amalan,Georgios Smaragdakis,Tom J. Viering
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 33 pages, accepted at ICML 2026

点击查看摘要

Abstract:Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolutionary relationships among malware families can inform proactive defense, but traditional reverse engineering can take months to years to uncover such lineage relationships. We propose MalTree, a framework that applies bioinformatics inspired phylogenetic techniques (UPGMA and Neighbor-Joining) at scale to model malware evolution automatically using structural, behavioral, and image-based features. We introduce temporal validation using VirusTotal timestamps to assess whether inferred trees reflect actual evolutionary order. MalTree achieves 87% temporal consistency, indicating that inferred evolutionary relationships closely align with real-world emergence timelines. Our analysis shows that some families mutate over 10 times faster than others, suggesting that detection strategies should be tailored to family-specific evolutionary tempos. Case studies, including the Mirai botnet, confirm that inferred relationships from our phylogenetic tree align with documented threat intelligence. Our framework provides a foundation for shifting malware analysis from sample-by-sample classification toward lineage-aware evolutionary modeling.

[AI-62] NTILC: Neural Tool Invocation via Learned Compression

链接: https://arxiv.org/abs/2606.06566
作者: Andrew Krikorian,Yayuan Li,Jason J. Corso
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 Pages, 4 Figures, 5 Tables, 1 Algorithm

点击查看摘要

Abstract:Agentic tool-calling language models depend on large registries of callable APIs, functions, and local actions. Placing full tool specifications directly in the prompt incurs a cost that scales linearly with the size of the tool registry, rapidly consuming the context budget. As the registry grows, this leads to higher latency and degrades selection accuracy, particularly due to interference from irrelevant tools. We overcome these limitations by introducing NTILC, a neural tool selection and invocation framework that replaces in-context registry look-up with learned latent retrieval. NTILC maps both user intent and tool specifications into a shared embedding space, enabling tool selection via external retrieval rather than in-context lookup. The language model is conditioned only on the selected tool schema, allowing for precise, constrained argument generation. Central to our approach is a signature-aware composite objective, which augments semantic similarity with constraints derived from tool signatures (e.g., argument schema, type compatibility, and return types). By combining Circle Loss with a Functional Margin Loss, the model enforces separation between tools that are semantically similar but incompatible under their execution signatures. We evaluate NTILC on public tool-selection and function-calling datasets and report context token usage, retrieval accuracy, and selection latency metrics. Across these settings, NTILC reduces context window consumption by over 95% and inference latency by up to 74% compared to long-context ICT baselines.

[AI-63] WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

链接: https://arxiv.org/abs/2606.06564
作者: Kehan Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Residual connections are central to training deep Transformers, but standard PreNorm residual streams aggregate sublayer updates with fixed unit weights. Recent Attention Residuals replace this fixed accumulation with content-dependent depth-wise routing, and Block Attention Residuals make the mechanism efficient by routing over block-level residual summaries. However, a single block summary stores only the low-frequency total residual displacement inside a block, discarding directional structure such as attention-vs-MLP imbalance and early-vs-late block dynamics. We propose WAV v1, a lightweight multi-resolution residual routing method for decoder-only Transformers. Instead of representing each block only by its accumulated residual sum, WAV v1 augments every block with two directional detail bases: a phase basis that contrasts attention and MLP updates, and a split basis that contrasts early and late sublayer updates. These bases are routed together with standard block summaries through the same depth-wise softmax mixer, while negative detail-source initialization and detached RMS matching stabilize training. On character-level TinyStories and Text8 language modeling, WAV v1 shows a clear depth-dependent benefit. Although it is not consistently beneficial at 12 layers, it becomes competitive at 24 layers and outperforms all baselines at 48 layers. At 48 layers, WAV v1 reduces validation loss relative to Block AttnRes from 0.4960 to 0.4738 on TinyStories and from 0.9363 to 0.9305 on Text8, with negligible additional parameters. These results suggest that directional residual details, not only block-level sums, are important for scaling residual routing in deeper Transformers.

[AI-64] AI-Driven Test Case Generation from Natural Language Requirements: A Survey of Techniques and Research Gaps

链接: https://arxiv.org/abs/2606.06563
作者: Orimoloye Folorunsho,Hassan Reza
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Software testing is critical for verifying that systems meet specified requirements, yet remains among the most time-consuming and expensive activities in development. Requirements-based test generation allows test cases to be derived early from requirements artifacts, but generating them directly from natural language is challenging due to inherent ambiguity and imprecision. Recent advances in AI, natural language processing (NLP), and large language models (LLMs) have made automating this pipeline increasingly feasible, while introducing new risks including hallucination, reduced traceability, and inconsistent evaluation. This survey addresses four research questions: what AI and NLP techniques have been proposed for generating test cases from natural language requirements; what tools and frameworks support these approaches; how generated test cases are evaluated; and what research gaps remain. Following Kitchenham and Charters’ systematic review guidelines, we searched major scholarly databases spanning 2000-2025 and, after applying strict inclusion criteria, identified 21 primary studies. The literature is organized into three evolutionary eras, revealing that no existing approach simultaneously satisfies six key quality dimensions: automation, ambiguity handling, domain applicability, traceability, evaluation thoroughness, and hallucination control. The survey makes three main contributions: a three-era evolutionary synthesis of AI-based test generation; a six-criteria gap analysis showing no current approach fully addresses all quality dimensions; and four actionable research guidelines targeting hallucination, traceability, complexity sensitivity, and compliance.

[AI-65] IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

链接: https://arxiv.org/abs/2606.06559
作者: Tao Zhong,Jiajun Deng,Nikita Kuzmin,Yinke Zhu,Tianxiang Cao,Tristan Tsoi,Zhili Tan,Simon Lui,Xunying Liu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM’s conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

[AI-66] Multi-Scale Feature Attention Network for Polymer Classification using THz Dual-Comb Spectroscopy

链接: https://arxiv.org/abs/2606.06554
作者: Roshni Mahtani,Ilán Carretero,Laura Monroy,Aldo Moreno-Oyervides,Oscar Elías Bonilla-Manrique,Rocío del Amor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in EUSIPCO’26

点击查看摘要

Abstract:Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz Dual-Comb Spectroscopy (THz-DCS) offers a promising alternative, providing rapid, high-resolution, and non-destructive measurements. In this work, we leverage THz-DCS to classify 12 types of polymers, including pure polymers, multilayer films, commercial blends, and biopolymers. To handle the complexity of these spectral signals, we propose the Multi-Scale Feature Attention Network (MSFAN), a novel deep learning architecture tailored for THz-DCS data. The framework integrates feature gating for signal recalibration and multi-scale parallel convolutions to capture diverse frequency patterns. These features are further refined through cross-feature attention and attention pooling, enabling the model to intrinsically highlight the most informative THz regions. MSFAN consistently outperforms state-of-the-art models, reaching a classification accuracy of 85.2%. This study demonstrates the potential of combining THz-DCS with deep learning techniques for effective, scalable, and interpretable polymer classification.

[AI-67] Geometric Second-Order Feature Correlation Learning for Self-Supervised Speech Emotion Recognition

链接: https://arxiv.org/abs/2606.06550
作者: Shuanglin Li,Ruxiao Qian,Siyang Song
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) yields powerful, context-rich representations for speech emotion recognition (SER), yet aggregating these representations into holistic descriptors remains a bottleneck. Conventional first-order aggregation implicitly assumes feature independence, which overlooks the latent Riemannian geometry and discards higher-order relationships essential to the representational power of the backbone. To address this problem, this paper proposes a novel Second-Order Correlation (SOC) layer. Instead of treating features in isolation, SOC models feature correlations as covariance descriptors to capture synergistic co-occurrence patterns, which serve as discriminative signatures for robust emotion recognition. By mapping these descriptors from the Riemannian manifold to a Euclidean tangent space through Log-Euclidean mapping (LEM), the proposed method preserves geometric integrity while enabling direct linear discriminative learning. Extensive experiments on the ESD and RAVDESS datasets demonstrate that SOC recovers discriminative information lost in first-order pooling and effectively aggregates high-dimensional SSL features.

[AI-68] FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models ICML2026

链接: https://arxiv.org/abs/2606.06547
作者: Haoyu Huang,Linlin Yang,Sheng Xu,Boyu Liu,Guodong Guo,Zhongqian Fu,Hang Zhou,Baochang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a poster at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, leading to a “stability lag” where early decisions remain fragile even after being written. We reveal that Post-Training Quantization (PTQ) error easily flips these borderline decisions at the write frontier, which are then permanently locked in and amplified. To address this, we propose Frontier-Aware Instability-Reweighted Calibration (FAIR-Calib), a two-stage PTQ framework for dLLMs. Stage I probes a full-precision teacher to estimate a position prior that combines frontier hits and masked-stage reliability. Stage II performs off-policy, layer-wise calibration by minimizing a reweighted hidden-state MSE, effectively prioritizing the protection of fragile frontier states without requiring expensive end-to-end diffusion rollouts. We further theoretically justify our weighted objective as a surrogate for output KL divergence. Empirically, FAIR-Calib consistently outperforms state-of-the-art baselines on LLaDA and Dream (W4A4), significantly reducing frontier decision flips and suppressing post-commit mismatches across diverse benchmarks.

[AI-69] Queen-Bee Agents : A BeeSpec-Centered Architecture for Governed Enterprise MCP Orchestration

链接: https://arxiv.org/abs/2606.06545
作者: Dutao Zhang,Liaotian
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Technical report. Prototype-level systems evidence; 59 enterprise-style tasks

点击查看摘要

Abstract:Enterprise agent systems increasingly need to connect large language models to private tools, internal knowledge, and Model Context Protocol (MCP) interfaces. In this setting, raw task capability is insufficient: organizations also require policy enforcement, tenant-scoped isolation, and execution that remains within explicit operational boundaries. We present Queen-Bee, a governed multi-agent architecture in which a Queen control plane retrieves capabilities, plans task-scoped execution, and compiles a structured BeeSpec that is executed by specialized Bee agents under constrained tool access. We implement a working prototype with tenant-scoped MCP connectors, audit-backed execution-time governance, retrieval-driven weak incubation, and multiple provisioning backends. We evaluate the system on 59 enterprise-style tasks spanning governance-sensitive requests, retrieval-driven provisioning, scoped local execution, and chemistry workflow integration. The retrieval-driven Queen-Bee variant achieves a task success rate of 0.964, zero governance failures, and substantially better scoped execution quality than both a static Queen-Bee baseline and a permissive single-agent baseline. We further show a multi-Bee chemistry workflow with explicit approval gating and a concrete top-3 shortlist grounded in real upstream evidence and screening artifacts. Additional comparisons with hybrid retrieval and LLM-guided provisioning show that richer provisioning backends are viable but do not outperform the lightweight structured retriever on the current small, highly structured capability registry. The results provide prototype-level systems evidence rather than a production deployment study, and suggest that enterprise agent platforms should be evaluated not only by capability, but also by governed provisioning, isolation behavior, scoped execution quality, and artifact-aware workflow coordination.

[AI-70] CARVE-Q: Quantum-Proposed Classically Certified Interactive Driving Repair

链接: https://arxiv.org/abs/2606.06531
作者: Yifan Wang
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:The critical question after a correct driving veto is not only whether a maneuver is unsafe, but whether the blocked interaction admits a lawful, auditable, and responsibility-bounded repair. Prediction and game-theoretic planners can suggest plausible cooperation, yet they do not return a proof that the repair respects hard rules, right-of-way, cost allocation, and ego fallback. We introduce CARVE, Certified Affordable Repair of Vetoed maneuvers via Envelopes, a certificate architecture for prediction-free interactive repair. Given a vetoed maneuver, CARVE constructs a finite repair lattice and emits a structured certificate recording the binding rule, selected joint repair, right-of-way-scaled cooperation envelope, responsibility-weighted cost split, and ego-only fallback. This certificate view reveals the algorithmic bottleneck: multi-owner repair induces a product lattice M = \prod_j |\mathcalA_j| . We therefore introduce CARVE-Q, a verifier-shielded quantum-AI search layer that applies quantum minimum finding only to this black-box lattice while leaving all safety authority classical. In the conservative verifier-oracle model, exact classical minimum finding requires \Theta(M) queries in the worst case, whereas Durr-Hoyer/Grover minimum finding uses O(\sqrtM) oracle queries with high probability. We prove verifier-shielded certificate soundness, priority non-elicitation, black-box query separation, and finite-precision reversible-oracle constructibility. We then demonstrate state-vector minimum finding on CARVE repair oracles up to 65,536 assignments and validate certificate preservation on Lanelet2-grounded INTERACTION replay with 100% right-of-way respect, 100% blame consistency, and zero priority false positives. The result is a trust-bounded quantum-AI pattern for certified autonomy: quantum proposes; CARVE certifies.

[AI-71] Attack Selection in Agent ic AI Control Evaluations Meaningfully Decreases Safety

链接: https://arxiv.org/abs/2606.06529
作者: Catherine Ge-Wang,Tyler Crosse,Benjamin Hadad IV,Joachim Schaeffer,Ram Potham,Tyler Tracy
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safety framework for deploying capable but untrusted AI agents under the oversight of a weaker, trusted monitor and a limited human audit budget. Control evaluations stress-test these protocols by pitting a red-team attack policy against the blue-team monitor, but current evaluations typically assume attackers that do not strategically select when to attack. We study this capability, attack selection, in agentic settings by decomposing attack decisions into a start policy, which decides when an attacker should attack, and a stop policy, which decides when an attacker should abort an ongoing attack. Across two agentic settings, BashArena and LinuxArena, both policies substantially lower measured empirical safety without changing the underlying attack capability. At a 1% audit budget, our start policy reduces safety by 20pp on both BashArena and LinuxArena, and our stop policy reduces safety by 20pp on BashArena and 28pp on LinuxArena. These reductions should be interpreted as upper bounds on the effect of attack selection. Existing control evaluations may therefore yield overly optimistic safety estimates against selective attackers. We recommend that future evaluations, system cards, and safety cases elicit attack selection to produce more realistic safety estimates.

[AI-72] CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

链接: https://arxiv.org/abs/2606.06526
作者: Sherin Muckatira,Jesse Geneson,Slava Gerovitch,Pavel Etingof,Mikhail Gronas,Anna Rumshisky
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES–Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.

[AI-73] Agent ic Large Language Models for Automated Structural Analysis of 3D Frame Systems

链接: https://arxiv.org/abs/2606.06525
作者: Ziheng Geng,Ian Franklin,Santiago Martinez,Jiachen Liu,Yunhe Zhao,Minghui Cheng
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful foundation models with strong reasoning capabilities across domains. Beyond reactive text generation, agentic LLMs enable autonomous workflow execution through modular task decomposition and coordinated tool use. In structural engineering, recent efforts have developed agentic LLMs for automated analysis of plane frames. However, their extension to 3D frames remains underexplored due to challenges in irregular geometric representation, topological consistency, and long-horizon reasoning. This paper proposes an agentic LLM framework for automated structural analysis of 3D frames from natural language inputs. Irregular 3D frames are represented by projection onto a 2D plan, where orthogonal gridlines define spatial coordinates and a matrix of number of stories encodes vertical extrusion of each grid cell. Building on this representation, the framework establishes a multi-agent pipeline: a problem analysis agent parses input into structured JSON; a floor decomposition agent derives the spatial layout of each floor; the 3D geometry is assembled by node, girder, slab, and column agents; support and load agents assign boundary and loading conditions, and code translation agents generate executable SAP2000 script. Evaluated on ten representative 3D frames, the proposed framework achieves an average accuracy of 90% across repeated trials, demonstrating consistent and reliable performance.

[AI-74] Lean4Agent : Formal Modeling and Verification for Agent Workflow and Trajectory

链接: https://arxiv.org/abs/2606.06523
作者: Ruida Wang,Jerry Huang,Pengcheng Wang,Xuanqing Liu,Luyang Kong,Tong Zhang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs’ agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose Lean4Agent, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. Lean4Agent launches FormalAgentLib, an extensible Lean4 library for formally modeling and verifying agent workflows’ semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on FormalAgentLib, we further develop LeanEvolve, which applies results in FormalAgentLib to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of 11.94%, and LeanEvolve further improves SWE performance by 7.47% on average. Furthermore, Lean4Agent establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.

[AI-75] P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=28

链接: https://arxiv.org/abs/2606.06521
作者: Reed Lau
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
备注: 8 pages, 3 figures, 3 tables, 1 algorithm. Technical note on FP8 E4M3 P-cast precision

点击查看摘要

Abstract:FP8 (E4M3) acceleration for attention computation offers significant throughput gains, but the 3-bit mantissa introduces precision challenges when the softmax probability matrix P is cast to FP8 before the P*V matrix multiplication. We analyze two implementation choices that affect output precision under the Attention Sink phenomenon: (1) the KV block iteration order, and (2) the static scaling factor applied to P before casting. We show that forward KV iteration causes “P-collapse” – to leading order, a fraction Phi(Delta + delta_k - 6.93 - ln S) of non-sink P values underflow to zero, where the small shift delta_k ~ 1 (for k_sink = 4) is the expected within-sink-block score maximum – and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with S = 256. We further give a constructive characterization of S = 256 = 2^8 as the static scale that simultaneously satisfies (i) bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function dp(S) over the E4M3 number line (dp = 2^-4, the minimum worst-case quantization step), and (iii) the maximum normal-range coverage among bit-exact (2^k) scales (a non-bit-exact scale such as 448 attains slightly higher coverage). Both optimizations are already deployed in FlashAttention-3/4 on engineering grounds; our contribution is a quantitative account of why these choices are good and a closed-form threshold Delta_c = 6.93 + ln S - delta_k for predicting kernel-level precision loss. Kernel-faithful experiments (Q, K, V in FP32 to isolate the P-cast effect) show 3-10x MSE improvement at moderate sink strengths, and paired tests confirm both fixes saturate to the same precision floor when combined.

[AI-76] SafeGene: Reusable Adapters for Transferable Safety Alignment

链接: https://arxiv.org/abs/2606.06519
作者: Yanghan Wang,Zhiqiang Kou,Fu Feng,Jing Wang,Xin Geng
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned–degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety–utility trade-off.

[AI-77] DiBS: Diffusion-Informed Branch Selection

链接: https://arxiv.org/abs/2606.06518
作者: Bo Liu,Yuan Xie,Yuan Gao,Xiaolong Luo,Peng Ye,Tao Chen,Fujun Han
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Sudoku is a representative constraint satisfaction problem that requires global structural reasoning under strict discrete constraints. The existing works of solving Sudoku mainly focus on two dominant approaches, i.e., traditional heuristic and deep learning solver. However, they suffer from two complementary limitations: learning-based solvers lack hard correctness guarantees, while complete symbolic solvers are still prone to long-tail search. To address these shortcomings, we propose a novel diffusion model-guided approach, termed as DiBS, for the branch selection search process. Specifically, DiBS keeps the symbolic solver complete and uses the diffusion model as a branch-ordering guide. The core method is ranking candidate values under the current partial assignment and lightweight consistency signal. Furthermore, we provide an in-depth theoretical proof to reveal how it works and why it works. Experiments on the challenging Royle 17-clue Sudoku benchmark show that our DiBS substantially reduces search cost relative to strong heuristic baselines, especially in nodes, backtracks, and long-tail percentiles. Besides, these results confirm that learned global guidance is effective on hard instances where branch-order mistakes are most expensive. All codes are available at this https URL.

[AI-78] DxPTA: An Architecture Design Space Exploration with Optical Dataflow-guided Strategy for HW/SW Co-Design of Photonic Transformer Accelerators

链接: https://arxiv.org/abs/2606.06515
作者: Rachmad Vidya Wicaksana Putra,Solomon Micheal Serunjogi,Mahmoud Rasras,Muhammad Shafique
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 8 pages, 12 figures

点击查看摘要

Abstract:Transformer-based networks have emerged as prominent AI models with state-of-the-art performance, which potentially pave the way toward artificial general intelligence (AGI). However, their large sizes still hinder their efficient implementation, thus highlighting the need for alternate solutions to enable their energy-efficient acceleration. Recently, state-of-the-art works propose photonic transformer accelerators (PTAs) with significant speedup and energy efficiency improvements over the conventional electronic accelerators. However, their PTA architectures are developed without considering the application constraints (e.g., area, power, energy, and latency). Moreover, their manual design approach also requires huge design time to determine a suitable architecture for the targeted application, hence making this approach not scalable. To address these limitations, we propose DxPTA, a novel design space exploration methodology for enabling efficient hardware/software co-design of the appropriate PTA architecture that meets all constraints. It is achieved by (1) identifying the PTA architecture parameters based on the coherent optical dataflow; (2) analyzing the impact/significance of the parameters; and (3) leveraging this analysis for devising a constraint-aware architecture search algorithm. Experimental results show that, our DxPTA can find the appropriate PTA architectures for different transformer-based models (i.e., DeiT-T/S/B and BERT-B/L). It achieves up to 26mm^2 area, 4.8W power, 39mJ energy, and 6ms latency, for constraints of 50mm^2 area, 5W power, 50mJ energy, and 10ms latency; with 15.2x faster searching time than the exhaustive approach. These results demonstrate the potential of DxPTA methodology for enabling efficient PTA designs for diverse AGI-based applications.

[AI-79] Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

链接: https://arxiv.org/abs/2606.06514
作者: Nishit Singh
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Machine learning systems deployed in high stakes socioeconomic settings routinely display bias. We formalize bias as a symmetry breaking operation: a classifier is fair if its outputs remain invariant under the counterfactual operation of switching a sensitive attribute, with merit features held fixed. We implement loss based regularization as a symmetry restoring mechanism and evaluate the framework on four synthetic datasets with varying levels of noise, correlation, and bias. The framework achieves upwards of 90% violation reduction, with accuracy costs around 5%. This framework does not require causal graph knowledge, is computationally lightweight, and generalizes to any sensitive attribute definable as a bit-flip, making it suitable for contexts where local sources of discrimination remain absent from mainstream benchmarks.

[AI-80] FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail

链接: https://arxiv.org/abs/2606.06510
作者: Satoshi Matsuoka
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: There is a companion Part (2) paper focusing on Ozaki-style FFT

点击查看摘要

Abstract:Conventional HPC dogma holds that native hardware FP64 silicon is the irreducible foundation of scientific computing – the “holy grail” of double-precision simulation. This paper argues the dogma is wrong: on AI-optimised GPUs of the B300 generation and beyond, abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II recovers memory-roof execution at full FP64 accuracy across the canonical HPC kernel spectrum. NVIDIA’s Blackwell Ultra (B300) collapses native FP64 to ~1.3 TFLOPS – a 31x regression from the B200 – rendering even memory-bound kernels (SpMV, GEMV, stencils) compute-bound. We make four contributions. First, a unified analytic model, the Tensor-Memory Equilibrium (TME) model, augmenting the Roofline with a compute multiplier alpha, a bandwidth multiplier beta, and a reconstruction latency gamma. Second, we identify register-level fusion as the mechanism driving beta - 1, making emulation essentially free behind the memory wall. Third, we project that Ozaki II vaults emulated FP64 from the ~1 TFLOPS native floor to ~500 TFLOPS (B300) and ~400 TFLOPS (Rubin R200), exceeding even B200’s native FP64 ceiling by over an order of magnitude in the compute-bound regime while matching the memory roof in the bandwidth-bound regime. Fourth, against an H100 baseline, Ozaki II matches or exceeds H100 on every workload studied, versus the up-to-50x regression that B300 native FP64 imposes. Combined with a companion FFT analysis (Kulisch fixed-point reconstruction on the surviving INT32 pipe) and FP32+Kahan reductions reported in the companion Part(2) paper, every surveyed kernel class on B300 reaches the memory roof at full FP64. The evidence supports the title’s claim: FP8, with Ozaki II and Kulisch escape routes, is all one needs for production HPC; native FP64 silicon is no longer the holy grail it has been taken to be.

[AI-81] On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

链接: https://arxiv.org/abs/2605.24649
作者: Sai Sandeep Damera,Ryan Matheu,Aniruddh G. Puranic,John S. Baras,Calin Belta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Systems and Control (eess.SY)
备注: 9 pages, 3 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy. Standard RNN architectures offer no structural guarantee that outputs degrade gracefully under sensor degradation; a dropped input can silently flip a verdict from safe to unsafe. We introduce the Recurrent Differentiable Ternary Logic Gate Network (R-DTLGN), a recurrent architecture that operates over Kleene’s three-valued logic -1, 0, +1\ , where 0 explicitly represents unknown. The R-DTLGN trains through continuous polynomial surrogates and hardens to a discrete ternary logic circuit at inference. We analyze the hardened circuit through two gate vocabularies derived from two orderings on the ternary domain: numerically monotone gates ensure stable recurrent dynamics, while information-monotone gates, when present, guarantee principled abstention (unknown inputs never produce wrong outputs) and monotonicity in input certainty (more information can only improve the verdict). We show that the recurrent connections required by bounded STL operators use exclusively AND and OR, which belong to both vocabularies, linking the monitoring task to the architecture’s guarantees. A realizability bound derived from the STL formula’s temporal operators directly sizes the network’s hidden state, replacing hyperparameter search with a formula-driven specification. We evaluate on STL specifications over D4RL PointMaze navigation data, testing prediction accuracy, degradation under predicate dropout, and the accuracy-versus-safety tradeoff between two label construction pipelines. The R-DTLGN is, to our knowledge, the first recurrent architecture that couples learned temporal prediction with formal degradation guarantees rooted in three-valued logic.

[AI-82] A Temporal Spatial Minimax Rate for Smoothly-Varying Distributions in Wasserstein Space

链接: https://arxiv.org/abs/2606.07325
作者: Munsik Kim
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:We study the minimax rate of estimating a future value \mu_t_n+h of a curve t\mapsto\mu_t in the 2 -Wasserstein space \mathcalP_2(\mathbbR^d) from finitely many noisy snapshots of its past, under an adiabatic bound |\nabla_t^k v|\le\varepsilon on the k -th covariant derivative of the velocity field. Our central result is a unified temporal-spatial minimax lower bound: over regular, locally transport-rich subclasses, every estimator incurs W_2 -risk with M -exponent \gamma_d(k+1)/(k+1+\gamma_d) , \gamma_d=\min(1/d,1/2) ( M the total sample size). It follows from a temporal-to-spatial reduction: the smoothness budget defines a reachable W_2 -ball into which a transport packing is embedded along the time axis, and the information of the entire snapshot experiment is controlled by a Fano argument – the spatial packing is classical, but its smoothness-admissible temporal embedding and the full-window analysis are new. The bound interpolates a dimension-free extrapolation floor of order \varepsilon h^k+1 – the irreducible cost of an unobserved future, present even with the exact past – and the spatial estimation curse M^-\gamma_d , recovering the static distribution-estimation rate as k\to\infty . We state the lower bound in a design-dependent form – with a design-weighted effective sample size – valid for arbitrary observation times, and obtain the closed-form exponent in the dense (equispaced) regime. The matching upper bound is established at k=0 (rate M^-1/(d+1) , d\ge3 ) and, in a translation submodel, for all k ; for k\ge1 a covariant estimator attains the rate conditionally on two estimates (a comparison-geometry bias bound and an optimal-transport map-estimation rate), leaving the unconditional general- k upper bound as an open problem. Numerical experiments on synthetic curved and flat families corroborate the predicted exponents.

[AI-83] SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

链接: https://arxiv.org/abs/2606.06907
作者: Seonuk Kim,Yonghyeon Jun,Ju Yeon Kang,Jimin Hong,Yoonhyeong Lee,Nam Soo Kim
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.

[AI-84] Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

链接: https://arxiv.org/abs/2606.06772
作者: Junyu Zhou,Puyu Wang,Yunwen Lei,Marius Kloft,Yiming Ying
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 37 pages

点击查看摘要

Abstract:Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.

[AI-85] Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks

链接: https://arxiv.org/abs/2606.06764
作者: Junyu Zhou,Puyu Wang,Yunwen Lei,Yiming Ying,Ding-Xuan Zhou
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 1 table

点击查看摘要

Abstract:Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized neural networks within the neural tangent kernel (NTK) regime. However, most of the existing work on regression problems is limited to shallow network architectures, leaving a notable gap in the theory of deep neural networks. This paper addresses this gap by presenting a comprehensive generalization analysis for deep ReLU networks trained using gradient descent (GD) and stochastic gradient descent (SGD). Specifically, we establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.

[AI-86] Coordinated optimization of departure sequencing and section-track allocation in railway short-term concentrated departure scenarios based on qubo and hybrid quantum algorithms

链接: https://arxiv.org/abs/2606.06543
作者: Xiaobin Li,Yanbin Gao,Weiguang Wang,Xuechen Liang
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study examines the coordinated optimization of departure sequencing and section-track allocation in railway short-term concentrated departure scenarios. A quadratic unconstrained binary optimization (QUBO) model is formulated to represent departure-position assignment and section-track selection within a unified binary framework. Because the quality of a dispatching scheme depends on time-dependent operational interactions that cannot be fully captured by a static combinatorial model, a simulation-based evaluation layer is introduced to assess section occupation, intermediate-station waiting, platform-capacity pressure, running-time fluctuations, and delay propagation. Within this layered framework, conventional heuristics, quantum-inspired algorithms, and hybrid algorithms are compared on the same decision structure. The results show that the QUBO model can generate feasible candidate schemes after decoding, while the simulation layer clearly differentiates the operational performance of the competing algorithms under both normal and disturbed conditions. In the tested scenarios, QPSO-QAOA performs best under normal conditions, and the quantum-enhanced methods reduce comprehensive cost by 4.28%–26.26% and total delay by 4.37%–24.25% on average under dynamic conditions relative to their conventional counterparts. These findings suggest that the integration of QUBO-based modeling and simulation-based evaluation provides a useful methodological framework for railway short-term concentrated departure scheduling, although validation with real operational data remains necessary.

[AI-87] Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models CVPR2026

链接: https://arxiv.org/abs/2606.06534
作者: Jialin Wu,Qianru Zhang,Georges El Fakhri,Xiaofeng Liu
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 Workshop PHAROS-AIF-MIH

点击查看摘要

Abstract:Longitudinal medical visual question answering (VQA) requires reasoning about anatomical differences between an image of a current time point and an image of a referred time point. We propose an attention-guided encoder-decoder for this task with chest X-rays. Instead of conventional direct contrast, we propose to include a lightweight affine registration module to reduce nuisance motion by co-registering the current image to the reference image with a small registration regularizer. The registered image pair is fed into the image encoder, followed by a frozen DINO-based mask generator and a trainable adaptive mask generator to produce masks applied to the original image pairs. The masked image pairs are again fed into the image encoder and concatenated with text features as the input to a multimodal transformer-based decoder to generate final answers. To facilitate learning stabilization and clarify the change signal, inspired by DINO-v3, we include additional auxiliary objectives, including a mask rebuilding loss, a pairwise Gram-style consistency loss, and a KoLeo uniformity loss, which enhances the geometry of the representation. On the Medical-Diff-VQA benchmark, the model delivers strong BLEU, ROUGE-L, CIDEr, and METEOR scores while offering intrinsic interpretability through the shared saliency mask. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA. Our training strategy also illustrates the potential of a paradigm in utilizing image foundation models in biomedicine: optimizing both supervised and unsupervised learning objectives simultaneously.

[AI-88] Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction ICML2026

链接: https://arxiv.org/abs/2606.06509
作者: Himanshu Singh
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
备注: ACCEPTED at ICML 2026 Workshop GlobalSouthML (Seoul, South Korea; PMLR 306, 2026)

点击查看摘要

Abstract:Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance gains are driven mainly by more expressive models or by better representation of clinically meaningful anatomy. We study this question through a low-data anatomy-aware benchmark for 5-class cardiac pathology prediction on the public ACDC MRI dataset. Using segmentation-derived patient descriptors from the right ventricle, myocardium, and left ventricle, we compare anatomy-specific and multi-structure representations across linear, kernel, and tree-based classifiers. We find that under limited label settings, representation dominates complexity. These results suggest that in resource-constrained healthcare settings, identifying and representing the most informative anatomy may matter more than the increasing complexity of the model alone.

[AI-89] Autonomous heterogeneous catalyst discovery with a self-evolving multi-agent digital twin

链接: https://arxiv.org/abs/2606.05050
作者: Zhilong Song,Zongmin Zhang,Lixue Cheng
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

Abstract:Theoretical heterogeneous catalysis promises rapid catalyst discovery, yet computational and machine-learning predictions often deviate from experiment and stay confined to narrow material families, for want of a faithful, condition-aware catalytic simulator. We present CatDT (Catalysis Digital Twin), a self-evolving multi-agent system that builds an autonomous digital twin of a working catalyst, unifying gas-solid and liquid-solid modeling. From only a bulk crystal and a natural-language reaction description, eight specialized agents and 27 scientific tools predict stable facets, reconstruct working surfaces, enumerate and rank reaction pathways, locate transition states, and compute kinetics in 5-30 min on a single GPU. Two innovations address the hardest steps: UniMech finds dominant pathways for novel materials at over 10^3\times lower cost than exhaustive enumeration by fusing agent-guided proposals with energy-cached graph search, and a memory-augmented reinforcement loop raises barrier-calculation success from 41% to 84% across 600 catalytic surfaces. Across seven gas-solid benchmarks – stepped metals, single-atom catalysts, ordered intermetallics, vacancy-rich 2D sulfides and carbides, and a strong-metal–support-interaction (SMSI) interface – every CatDT prediction lies within 0.5-2 times experiment over four orders of magnitude. For propane dehydrogenation, CatDT independently discovers non-precious candidates rivaling the Pt-based industrial benchmark, with a proposed Ni@ZrO _2 SMSI overlayer reaching a simulated TOF of 1.63~\texts^-1 at \sim 100% selectivity. More broadly, the decisive factor for a faithful catalyst digital twin – or any multi-stage scientific simulator – is not raw LLM capability but the engineered harness around it: deterministic tools, persistent memory, and verified self-improvement that compound across models, tools, and runs.

机器学习

[LG-0] Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization

链接: https://arxiv.org/abs/2606.07496
作者: Ming Sun,Kun Yuan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Decentralized stochastic optimization is a fundamental paradigm for large-scale learning over networks, where agents communicate only with their neighbors and no central coordinator is required. For strongly convex problems, communication efficiency is mainly determined by the condition number (\kappa=L/\mu) and the network spectral gap (1-\beta). Although deterministic decentralized methods can simultaneously achieve accelerated (\sqrt\kappa) and (1/\sqrt1-\beta) dependences, no existing stochastic method attains both improvements at once. In this paper, we propose \emphMulti-Gossip Accelerated DSGD (MG-ADSGD), a decentralized stochastic algorithm that combines Nesterov-type primal–dual extrapolation with multi-round fast gossip averaging. The key idea is to couple the gossip depth with the mini-batch size so that additional communication rounds simultaneously improve consensus accuracy and reduce gradient variance. We show that MG-ADSGD achieves the communication complexity [ \widetilde\mathcal O!\left( \frac\sigma^2\mu n\epsilon\log\frac1\epsilon + \sqrt\frac\kappa1-\beta\log\frac1\epsilon \right), ] where (\epsilon) denotes the target accuracy, (n) is the number of nodes, and (\sigma^2) is the gradient variance. To the best of our knowledge, this bound yields the best currently available communication complexity for decentralized stochastic strongly convex optimization, up to logarithmic factors that are independent of \epsilon .

[LG-1] Second-Order Path Kernel Interpolation Formulas in Machine Learning

链接: https://arxiv.org/abs/2606.07495
作者: Jin Guo,Roy Y. He,Jean-Michel Morel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding how training data shape neural network predictions is a central problem in modern learning theory. In 2020, Pedro Domingos proposed an interpolation formula valid for every model learned by deterministic gradient descent. It expresses the model’s prediction as an integral, along the optimization path, of a data-dependent kernel that aligns the model’s gradients at the test and training data. Such a first-order characterization remains valid for models trained with batch-based stochastic optimization. In this paper, we develop second-order forms of these interpolation formulas. We show that the leading path-kernel interpolation is supplemented by a curvature-weighted interpolation term. For stochastic gradient descent, an additional sampling-induced component appears, coupling the curvature of the prediction with the covariance of mini-batch gradient noise. We also extend the representation to stochastic gradient descent with momentum, where the interpolation structure is preserved but with the weights modified by a memory-related factor. Moreover, we establish a concentration estimate for the terminal prediction, identifying the fluctuation scale around the expected second-order representation. Together, these results provide a refinement of the path-kernel interpretation of neural network prediction.

[LG-2] CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations

链接: https://arxiv.org/abs/2606.07488
作者: Ryan Missel,Xiajun Jiang,Linwei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized virtual heart simulations face challenges in model personalization and computational cost. While neural surrogates offer state-of-the-art solutions, they typically address either efficient personalization or training generalizable models. Recent work reframes this by learning the process of personalizing a surrogate using limited subject-specific context data, through few-shot generative modeling with set-conditioned surrogates and meta-learned amortized inference. These methods, however, assume a static and diverse training distribution with known task identifiers. When new data becomes available, they require costly retraining with all prior data to avoid catastrophic forgetting - a phenomena where the model forgets earlier tasks when trained on new ones. This is a major limitation in clinical settings where often unlabeled data arrives sequentially and full retraining is infeasible. This paper presents a new continual meta-learning framework to achieve personalized neural surrogates able to not only continually integrate information but also identify whether incoming data stems from a known or unknown dynamics source. By leveraging a continual Bayesian Gaussian Mixture Model over a memory buffer, our framework can infer the identifiers and relationships of data over time - required for effective meta-learning. Empirical results on synthetic cardiac data demonstrate superior simulation forecasting, computational scalability, and resilience to catastrophic forgetting compared to existing baselines.

[LG-3] Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach

链接: https://arxiv.org/abs/2606.07483
作者: Lei Huang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many important outcomes unfold as dynamic cascades, including product adoption, disease spread, financial distress, and information diffusion. A central challenge is to recover the hidden influence network behind these cascades. Existing methods typically assume a specific diffusion model, and their performance degrades substantially when that assumption is misspecified. We propose CascadeNet, a Jacobian-based machine learning framework for network recovery that does not require specifying a diffusion mechanism. The key idea is that the underlying influence structure can be characterized by the Jacobian of the one-step transition function. CascadeNet first constructs a flexible estimator of the transition function, and further applies Neyman-orthogonal debiasing via the Riesz representer, so that the debiased Jacobian is \sqrtn -consistent and asymptotically normal, enabling formal inference on the network structure. We validate CascadeNet in both a simulation exercise and a real-world empirical application. In simulations, where the data-generating process is known, CascadeNet achieves the highest network recovery accuracy across nine common data-generating processes. In an empirical application to COVID-19 transmission across Spain’s 52 provinces, CascadeNet recovers transmission networks that are significantly correlated with the true inter-province mobility network, whereas networks recovered by baseline methods show no significant alignment with the ground truth.

[LG-4] Drifting Models for Surrogate Flow Modeling

链接: https://arxiv.org/abs/2606.07481
作者: Chris R. Jung,Markus Dörr,Natalie Jüngling,Jennifer Niessner,Adam T. Müller,Nicolaj C. Stache
类目: Machine Learning (cs.LG)
*备注: Accepted to the 2nd International Symposium AI and Fluid Mechanics 2026

点击查看摘要

Abstract:While Computational Fluid Dynamics (CFD) provides high-fidelity flow fields for optimizing indoor environments, its computational cost limits rapid exploration. To solve this problem generative surrogates offer better distribution modeling than deterministic networks, but iterative sampling is slow. To enable high-quality, single-pass generation, we adapt the novel generative drifting framework to fluid mechanics. We introduce a conditional architecture that performs drifting in a learned VAE latent space and uses label-aware masking to align generated samples with their boundary conditions. Our label-conditioned model matches iterative diffusion in accuracy and flow consistency while running two orders of magnitude faster. Additionally, we propose a spatial-conditioning variant that establishes a promising path towards generalization to unseen geometries. Ultimately, conditional drifting serves as a highly efficient alternative to diffusion based approaches, unlocking real-time CFD surrogates where inference speed is critical.

[LG-5] Unsupervised Continual Clustering via Forward-Backward Knowledge Distillation KDD2026 ECML

链接: https://arxiv.org/abs/2606.07474
作者: Mohammadreza Sadeghi,Sareh Soleimani,Zihan Wang,Narges Armanfard
类目: Machine Learning (cs.LG)
*备注: Accepted at ECML PKDD 2026 (Research Track). arXiv admin note: substantial text overlap with arXiv:2405.19234

点击查看摘要

Abstract:Unsupervised Continual Learning (UCL) aims to enable neural networks to learn sequential tasks without labels or access to past data. A major challenge in this setting is Catastrophic Forgetting, where models forget previously learned tasks upon learning new ones. This challenge is amplified in UCL due to the absence of labels to guide learning and memory retention. Existing mitigation strategies, such as knowledge distillation and replay buffers, often raise memory and privacy concerns. Moreover, current UCL methods largely overlook clustering-specific objectives. To fill this gap, we introduce Unsupervised Continual Clustering (UCC) and propose Forward-Backward Knowledge Distillation for Continual Clustering (FBCC). FBCC employs a continual teacher network with a clustering projector and lightweight task-specific students. Through a dual-phase forward-backward distillation process, the teacher learns new clusters while preserving previously discovered cluster structure without storing past data. FBCC represents a pioneering approach to UCC, demonstrating improved clustering performance across sequential tasks. Experiments on four benchmark datasets demonstrate that FBCC consistently outperforms existing continual learning baselines in clustering accuracy while significantly reducing catastrophic forgetting.

[LG-6] me series Foundation Models based on Physics-Informed Synthetic Histories for Cold-Start Photovoltaic Forecasting ICML

链接: https://arxiv.org/abs/2606.07457
作者: Lorenzo Longarini,Alessandro Rongoni,Simone Silenzi,Emanuele Frontoni,Riccardo Rosati
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: To be published in the 2nd ICML Workshop on Foundation Models for Structured Data

点击查看摘要

Abstract:At commissioning time, Photovoltaic (PV) operators must forecast production before target-site observations are available, limiting the direct use of standard supervised forecasters. This cold-start setting is addressed with a zero-shot pipeline that generates a synthetic production history from plant metadata and meteorological covariates, enabling time-series foundation models (TSFMs) to forecast through inference-time conditioning. Five TSFMs are benchmarked against classical baselines under strict Cold-Start Baseline, Real Feedback, and Self-Forecast Feedback strategies. The evaluation spans 440 PV sites across four datasets and diverse climate regimes. Covariate-aware foundation models outperform baselines by approximately 1.7-2\times : TabPFN-TS achieves the lowest error under Real Feedback (MAE 0.514 , RMSE 0.721 kWh kWp^-1 d^-1 ), while Chronos-2 is most robust under Self-Forecast Feedback. Performance is largely insensitive to the synthetic-history source, indicating that accuracy is driven more by the availability of plausible temporal context than by the specific generator.

[LG-7] Discovering Multiscale Deep Formulas in Complex Systems via Neural-Guided Lambda Calculus

链接: https://arxiv.org/abs/2606.07426
作者: Hanqiao Yu,Shusen Yang,Xuebin Ren,Cong Zhao
类目: Machine Learning (cs.LG)
*备注: 35 pages, 5 figures; Supplementary Information available as an ancillary file (79 pages)

点击查看摘要

Abstract:A fundamental problem in science is identifying underlying patterns of complex systems in the form of concise mathematical formulas. Current Artificial Intelligence (AI)-based methods have shown strong performance in single-scale systems, yet remain limited in identifying scale-specific formulas in multiscale complex systems. We present Deflex, an end-to-end AI method to automatically extract multiscale formulas with potentially different forms, including invariants and distributions, from complex systems. Deflex consists of two subsystems named Deflexformer and Deflexpressor. Deflexpressor is a lambda-calculus symbolic regression model for higher-order formulas. Deflexformer is a decomposable deep energy model for learning unified representations across scales. Deflexpressor generates synthetic data to pre-train Deflexformer, which then guides formula discovery by decoupling multiscale latent relationships. Across six representative complex systems with diverse behaviors, Deflex achieves up to 7-fold higher efficiency than the state-of-the-art methods while enabling automated multiscale discovery. Our work could be a useful tool for scientific discovery across disciplines.

[LG-8] Video-Based Prediction of In-Flight Particle Characteristics in Atmospheric Plasma Spraying KDD2026 ECML

链接: https://arxiv.org/abs/2606.07416
作者: Abhijeet Praveen,Sareh Soleimani,Cormac Cureton,Aman Sidhu,Kintak Raymond Yu,Cristian Cojocaru,Narges Armanfard
类目: Machine Learning (cs.LG)
*备注: Accepted at ECML PKDD 2026 (Applied Data Science Track)

点击查看摘要

Abstract:Atmospheric plasma spraying (APS) is a widely used coating process in which in-flight particle temperature and velocity strongly influence coating quality. However, these particle characteristics are difficult to monitor continuously during operation, motivating the development of non-invasive data-driven diagnostic methods. In this work, we investigate the predictive potential of high-speed video observations of the plasma plume for estimating in-flight particle characteristics in APS. We introduce three different video-derived feature representations and evaluate them using Tabular Prior-Data Fitted Networks (TabPFN), convolutional neural networks (CNN), and classical regression baselines including Random Forest, Gradient Boosting, Support Vector Regression, and XGBoost. Experiments are conducted using grouped leave-one-out cross-validation on 126 labeled pre- and post-spray video recordings from 63 APS spray runs. Across the engineered feature experiments, TabPFN achieves the most consistent performance for temperature prediction, reaching R2 = 0.86 using the combined feature representation. CNN models particularly perform stronger for velocity prediction, achieving R2 of 0.81. In addition, we evaluate models operating directly on raw video frames using pretrained CNNs and find that the highest performance is achieved by a pretrained CNN with a regression head with R2 of 0.90 and 0.82 for temperature and velocity, respectively. The results demonstrate that video-derived plume information provides a promising and scalable foundation for non-invasive APS diagnostics and real-time process monitoring.

[LG-9] Sparsely gated tiny linear experts

链接: https://arxiv.org/abs/2606.07414
作者: Simon Schug
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Code available at this https URL

点击查看摘要

Abstract:Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonstrate that further increasing sparsity by shrinking each expert to consist of a single neuron and selecting a tiny fraction of many available neurons can improve compute efficiency and interpretability. Counterintuitively, the key to achieving both is removing the nonlinearity typically applied to the experts, resulting in a network of sparsely gated linear neurons (sgatlin). In an isoflop comparison, we find that replacing all transformer feedforward layers with sgatlin improves perplexity in language models across different compute budgets. At the same time, the sparsity and linearity of the resulting feedforward circuits present new opportunities for model interpretability. In a small-scale case study, we demonstrate that feedforward circuits in sgatlin can be interpreted without having to train additional replacement models. We find that they form semantically structured clusters and are causally implicated in factual recall. Our findings paint a possible path towards compute-efficient and interpretable transformer feedforward layers.

[LG-10] Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling KR

链接: https://arxiv.org/abs/2606.07404
作者: Rohan Shravan
类目: Machine Learning (cs.LG)
*备注: 58 pages, 9 figures, 37 tables. Code: this https URL . Released models: this http URL {2B, 5B-MoE, 9B-MoE, 120B-MoE}. Companion work: arXiv:2605.29379 (BrahmicTokenizer-131K), arXiv:2605.29459 (Kronecker Embeddings)

点击查看摘要

Abstract:This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense seed, through a 5B and a 9B mixture of experts, to a 120B model with 460 routed experts under top-12 routing. Each larger model is grown from the trained weights of the smaller one; active parameters rise monotonically from 1.78B at the dense seed to 5.93B at 120B (about 5% of the 118.67B stored). The full lineage runs on single nodes, the larger stages at 8K context, reaching a released training loss of 1.78 at 120B scale. This is a systems and experience report. It is organized around three disciplines. Reversibility: a reversible recurrence stack reconstructs activations in the backward pass instead of storing them, holding activation memory flat as the model grows. State-preserving growth: each expansion (dense to MoE, shallow to deep, few experts to many) is given as a reproducible principle paired with the failure that results from getting it wrong; several failures are silent. Single-node economics: the 120B trains through TQP, a strategy of quantized base expert weights and trained low-rank adapters that carries optimizer state on 2.26B adapter parameters rather than 100B+ resident in routed experts, cutting expert-path optimizer state by a factor of ~45. What is new is the integration of known primitives, not any primitive in isolation: one grown lineage running end to end on a single node, documented at practitioner level, with per-domain held-out loss as evidence that targeted capabilities (multilingual Indic competence, code) were learned by construction. Model family, tokenizer, and training code are released. Comments: 58 pages, 9 figures, 37 tables. Code: this https URL. Released models: this http URL2B, 5B-MoE, 9B-MoE, 120B-MoE. Companion work: arXiv:2605.29379 (BrahmicTokenizer-131K), arXiv:2605.29459 (Kronecker Embeddings) Subjects: Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.6; I.2.11 Cite as: arXiv:2606.07404 [cs.LG] (or arXiv:2606.07404v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.07404 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Generative Modeling of Discrete Latent Structures via Dynamic Policy Gradients ICML2026

链接: https://arxiv.org/abs/2606.07400
作者: Stefan Ivanovic,Ge Liu,Mohammed El-Kebir
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Many scientific problems require inferring unobserved mechanistic latent states from indirect observations. While classical approaches, including expectation maximization, do not scale to combinatorially large spaces, deep learning approaches such as variational autoencoders typically form artificial latent states rather than reconstructing the mechanistic ground-truth states. Here, we introduce GReinSS, a policy learning framework that uses dynamically rescaled rewards to learn latent state distributions that maximize the observed data likelihood. We show that GReinSS accurately reconstructs simulated latent sets and latent graphs, outperforming alternative policy learning and generative modeling baselines. Additionally, GReinSS reconstructs isoforms from real short-read RNA sequencing data that better match isoforms detected by orthogonal long-read sequencing than the standard RSEM algorithm. Overall, GReinSS is a principled and practically effective approach for generative modeling and inference of combinatorial latent states from indirect observations.

[LG-12] Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

链接: https://arxiv.org/abs/2606.07387
作者: Yun-Chen Cheng,Tzu-Hung Huang,Chih-Pin Tan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State-of-the-art text-to-music generation systems rely on massive proprietary datasets and industrial-scale compute, making it impossible to disentangle architectural contributions from resource advantages. We propose \textitscore-aware training, which treats audio-caption alignment score as a direct supervision signal throughout the pipeline. Rather than discarding low-scoring segments, we repurpose them via a CLAP-conditioned Beta noise timestep schedule that routes them to high-noise training regimes, acting as an effective implicit regularizer. Complementarily, segment-level filtering removes the most misaligned examples, and a two-stage caption procedure bridges the distribution gap between verbose training captions and concise inference prompts. A REPA auxiliary loss further transfers structured semantic knowledge from pretrained CLAP and MuQ encoders without additional data. Our 450M-parameter FluxAudio-based system, submitted to the ICME 2026 ATTM Grand Challenge Efficiency Track, ranked 2nd across both tracks in the objective evaluation and 3rd in the Efficiency Track in the final MOS evaluation.

[LG-13] RhinoVLA Technical Report

链接: https://arxiv.org/abs/2606.07383
作者: Huixi Intelligence:Chen Zhang,Chenyang Zhou,Guanglei Ding,Guanghui He,Haibin Gao,Jiajia Chen,Jianyong Zhang,Lianyi Yu,Ningyi Xu,Ping Xu,Qingchen Li,Yingjun Hu,Yijia Zhang,Yuxi Liu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment latency: for GEMM-dominated projection operators, computation grows linearly with the number of input tokens when model dimensions are fixed. Motivated by this observation, we propose RhinoVLA, a deployment-oriented VLA model co-designed with the Huixi R1 edge SoC. RhinoVLA adopts a token-efficient Qwen3-VL backbone and a continuous Action Expert, reducing the VLM-side token and computation burden while preserving pretrained multimodal capability. To support cross-robot learning, RhinoVLA further introduces a unified interface that combines View Registry, 72D physical state-action slot space, and robotinstance LoRA, allowing heterogeneous robot observations and action schemas to be aligned under a shared policy. On the deployment side, RhinoVLA is optimized through hardware-aware compilation, mixed-precision execution, and parallel visual encoding. Experiments show that RhinoVLA achieves downstream performance comparable to \pi0.5 at a similar parameter scale, while reaching 11.69 Hz end-to-end inference on Huixi R1, meeting the 10 Hz real-time closedloop control target. The project will be open-sourced at this https URL.

[LG-14] Covariance Shrinkage via Stochastic Interpolation

链接: https://arxiv.org/abs/2606.07382
作者: Mathieu Chalvidal,Florentin Coeurdoux,Eric Vanden-Eijnden
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages

点击查看摘要

Abstract:We recast classical shrinkage of high-dimensional covariance estimators as empirical risk minimization over a parametric stochastic interpolant between a source and a target distribution. This formalism recovers known shrinkage estimators as special cases and reveals three distinct mechanisms for reducing statistical risk: (i) Scheduling: the interpolant schedule determines the class of admissible covariances, and hence the achievable risk. (ii) Flow maps and couplings: whereas naive constructions amount to assuming independence between the distributions, specific coupling structures (e.g., solutions of optimal transport problems) can lower the empirical risk. Moreover, non-linear flow maps realizing such couplings free the interpolant covariance from the eigenbasis of the empirical estimate, enabling eigenvector regularization. (iii) Early stopping: estimators defined by integrating a regressed vector field afford an additional bias-variance trade-off through approximation of the true interpolant distribution. We then propose a neural estimator of the interpolant, together with an upper bound on its quadratic risk in terms of the interpolant approximation error, and validate both on synthetic experiments. Finally, we apply the estimator to real neuroimaging data, demonstrating the additional regularization power this approach offers in practice.

[LG-15] Self-evolving LLM agents with in-distribution Optimization ICML2026

链接: https://arxiv.org/abs/2606.07367
作者: Yudi Zhang,Meng Fang,Zhenfang Chen,Mykola Pechenizkiy
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often receive delayed rewards only at the end of episodes. In this paper, we propose Q-Evolve, a self-evolving framework for LLM agents that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. In each evolving iteration, our method learns an in-distribution critic from a hybrid off-policy dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent over the data used for process reward labeling, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on AlfWorld, WebShop, and ScienceWorld, showing Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable agent self-evolution is achievable through the co-evolution of process-level supervision and policy, both grounded within a shared in-distribution learning loop.

[LG-16] Breaking the Ice: Analyzing Cold Start Latency in vLLM

链接: https://arxiv.org/abs/2606.07362
作者: Huzaifa Shaaban Kabakibo,Animesh Trivedi,Lin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study of its startup latency. With major architectural innovations such as the V1 API and the introduction of this http URL, this paper presents the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that it is predominantly CPU bound. Each step exhibits consistent and interpretable scaling trends with respect to model-level and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM startup latency for a given hardware configuration, providing actionable guidance for resource planning in large-scale inference environments. All benchmarking datasets, analysis tools, and prediction scripts are open sourced at this https URL.

[LG-17] abSwift: An Efficient Tabular Foundation Model with Row-Wise Attention ICML2026

链接: https://arxiv.org/abs/2606.07345
作者: Si-Yang Liu,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026, spotlight

点击查看摘要

Abstract:Tabular foundation models, exemplified by TabPFN, perform prediction via in-context learning, inferring test labels directly from labeled training examples. They have demonstrated competitive performance, particularly on small-to-medium datasets. However, recent tabular foundation models often improve accuracy with increasingly complex architectures, incurring higher inference cost and limiting practical deployment. In this work, we revisit the original TabPFN design and show that a lightweight row-wise attention-only backbone can remain highly competitive with two simple enhancements: a gated attention stabilization mechanism and a small set of learnable register tokens that provide global context and improve pretraining quality. The resulting model, TabSwift, supports both classification and regression, and is competitive with stronger tabular foundation models (e.g., TabPFN v2 and TabICL) while being more efficient at inference. For latency-sensitive serving, we further introduce an adaptive layer-wise early-exit mechanism that dynamically adjusts inference depth per sample. Overall, TabSwift enables efficient and anytime tabular in-context learning for practical deployments.

[LG-18] How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

链接: https://arxiv.org/abs/2606.07334
作者: Jinju Lee
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Harmony is a compact symbolic layer where mathematical pitch relations, acoustic consonance, and musical convention meet. This report treats chord-symbol sequences not as a complete representation of music, but as an interpretable, controllable time series for genre-local harmonic modeling. Starting from a frozen pop-jazz Music Transformer checkpoint, I evaluate how far small adaptation interfaces can extend the model to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, RB/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction, with macro gains from +2.89 to +3.61 points; LoRA and IA3 score highest, but Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: when genres are sub-sampled to a common corpus size, IA3 stays on top but LoRA’s full-data edge disappears and it falls to last, indicating the small gaps are partly data-driven. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting much of the effect comes from lightweight conditioning over a reusable harmonic base rather than one particular adapter family. Additional diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation, chord-only genre classification, generated-output statistics, real-song evaluation, and duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. The report therefore avoids claims about perceived genre authenticity or full musical quality, which require controlled listener or musician evaluation.

[LG-19] Bootstrap Theory of Representational Emergence: Explanatory Insufficiency as a Driver of Representation Learning and World Models

链接: https://arxiv.org/abs/2606.07303
作者: Jacques Raynal,Pierre Slangen,Elsa Raynal,Jacques Margerit
类目: Machine Learning (cs.LG)
*备注: 24 pages, 25 references. Theoretical framework relating representation learning, representational emergence, and world models

点击查看摘要

Abstract:Representation learning is central to modern machine learning, enabling transitions from handcrafted features to learned embeddings, latent spaces, foundation models, world models, and digital twins. Yet most research examines how representations are optimized after a representational framework has been selected, while less attention is given to when a new level of representation becomes necessary. We introduce the Bootstrap Theory of Representational Emergence (TBER), a framework describing how new representations arise when existing ones become explanatorily insufficient. In this view, representational innovation is not only driven by more data, larger models, or greater computational power, but also by persistent explanatory gaps: situations in which a representation can still describe observations but can no longer make their organization or transformations intelligible. TBER identifies explanatory insufficiency as a positive signal for representational transition. A representation becomes insufficient not because it is necessarily false, but because its explanatory domain has been exceeded. The bootstrap dynamic follows a recursive sequence: observations reveal anomalies; anomalies expose insufficiencies; insufficiencies motivate new representations; and these new representations generate further observations and possible new this http URL formalize this process through five stages: stabilized observation, anomaly detection, recognition of explanatory insufficiency, representational emergence, and provisional stabilization. We discuss applications to representation learning, latent spaces, foundation models, world models, digital twins, adaptive biological systems, and scientific discovery. TBER suggests that future AI systems may benefit from mechanisms for detecting the explanatory limits of their own internal representations.

[LG-20] argetSEC: Plug-and-Play In-the-Wild Speech Emotion Conversion via Arousal-Conditioned Latent Style Diffusion

链接: https://arxiv.org/abs/2606.07293
作者: Constantin Alexander Auga
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, 2 tables, preprint

点击查看摘要

Abstract:Speech Emotion Conversion (SEC) aims to transform the emotion of a source utterance into a target emotion while preserving content and speaker identity. SEC on in-the-wild data is challenging due to the non-parallel nature of training data and complex real-world acoustics. Existing fixed-duration approaches either struggle to shift the emotion effectively (high quality, low conversion) or degrade speech naturalness (low quality, high conversion). We propose TargetSEC, an embedding-driven latent diffusion framework that generates emotion-focused style embeddings conditioned on speaker identity and continuous emotion. Unlike methods that diffuse over spectrograms, TargetSEC operates in a compact latent space. Experiments on the MSP-Podcast dataset show that TargetSEC outperforms current non-duration baselines in conversion accuracy while maintaining high speech quality, and achieves performance comparable to duration-prediction systems without explicit temporal modeling.

[LG-21] rio: Learning Time-Series Forecasting with Temporal-Spatial-Sample Attention and Structural Causal Priors

链接: https://arxiv.org/abs/2606.07291
作者: Tao Chen,Yexu Zhou,Zhi Gong,Hengwei He,Hongda Li,Zhewei Chen,Dongjing Wang,Xin Zhang,Decheng Liu,Chunlei Peng,Zheng Chen,Wenyue Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time-series forecasting requires models to reason over temporal dynamics, cross-variable dependencies, and historical input-output correspondences. Recent Prior-Data Fitted Networks (PFNs) suggest that synthetic tasks can be useful for learning transferable inference behavior. However, directly transferring this paradigm to time-series forecasting remains difficult, since temporal order, dynamic lags, and recurring historical patterns are not naturally captured by ordinary tabular priors. Motivated by this observation, we propose Trio, a sample-aware time-series forecasting architecture based on Temporal-Spatial-Sample attention. Temporal attention captures within-window dynamics, spatial attention models inter-variable dependencies, and sample attention retrieves relevant historical lookback-future pairs to guide the current prediction. Rather than claiming a fully general PFN-style forecaster, our goal is to study how historical input-output examples can be explicitly organized and reused within a forecasting model. We further introduce a Time-Series Structural Causal Model (TS-SCM) generator to create structured synthetic forecasting tasks with dynamic lags, cross-variable interactions, noise, feedback, and distributional drift. Experiments on synthetic, industrial, and public benchmarks show that the proposed architecture improves forecasting performance. Exploratory zero-shot experiments further suggest that TS-SCM-generated tasks may provide useful structural priors, while fully general PFN-style time-series forecasting remains an open problem.

[LG-22] he Capacity of Information-Theoretic Secure Aggregation in Federated Learning

链接: https://arxiv.org/abs/2606.07277
作者: Lanxin Yi,Jinbao Zhu,Kai Wan,Xiaohu Tang
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Secure aggregation allows a server to aggregate users’ local updates while preserving update privacy. Existing information-theoretic problems typically assume that correlated random keys are provided by a trusted third party (TTP) or generated via prescribed groupwise structures, while the communication cost for establishing such correlated keys is often ignored. Consequently, the fundamental limits under general key-distribution mechanisms remain unknown. In this paper, we study the T -colluding information-theoretic secure aggregation problem with N users under a general two-phase framework consisting of a key distribution phase and an update aggregation phase. Unlike prior work, we model key distribution through user-to-user communication and allow arbitrary user-generated key-distribution mechanisms, eliminating TTP or prescribed structures. This enables a joint characterization of three resources: randomness for security, key-distribution communication, and aggregation communication. We completely characterize the capacity region among these three resources by constructing a novel secure aggregation scheme together with a matching information-theoretic converse. In particular, we develop an explicit deterministic capacity-achieving construction over any finite field of size at least N , whereas most existing schemes either rely on TTP or employ randomized or existential constructions over sufficiently large finite fields. We further show that the optimal performance can be achieved using only pairwise shared keys, enabling implementation via Diffie–Hellman key exchange. Compared with Google’s seminal secure aggregation scheme, the proposed scheme requires fewer random masking keys while preserving the same aggregation communication overhead.

[LG-23] A Held-Out Transition-Pair Falsifier for Long-Horizon Non-Abelian State Tracking

链接: https://arxiv.org/abs/2606.07254
作者: Jeonghoon Lee
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
*备注: Technical preprint, 24 pages. 7 figures

点击查看摘要

Abstract:State tracking exposes a sharp limitation of sequence models: the relevant signal is often not a summary of observed tokens, but an ordered latent state that evolves through non-commutative transformations. We introduce a held-out transition-pair falsifier for finite non-Abelian group tracking. The protocol forbids selected ordered generator pairs during training and requires the same local patterns during evaluation, blocking one direct local-transition memorization pathway. In a controlled S_3 \times S_3 benchmark, a projected recurrent state model trained only on length-8 sequences produces error-free final-state predictions (perfect 250/250 per horizon) through evaluation horizons up to 1,048,576 tokens across five seeds. Matched native-readout baselines, including bag, GRU, and a single-configuration structured state-space model, remain near floor under the same protocol. Projection-matched GRU, structured SSM, and bag baselines equipped with analogous finite-group prototype readouts also remain near chance under the same split. Mechanism diagnostics show that hard projection coincides with low homomorphism error, low state-consistency drift, and non-trivial commutator separation, while softened projection collapses final-state accuracy. Clean-split audits verify zero verbatim reduced-word overlap and zero structural-template overlap between training and evaluation partitions. The evidence is scoped to this controlled finite-group falsifier rather than to a general architecture ranking. Within that regime, explicit projected non-commutative state composition acts as a useful inductive bias for long-horizon hidden-state tracking.

[LG-24] Generative Molecular Morphing for Flexible-Size Design via Unbalanced Optimal Transport

链接: https://arxiv.org/abs/2606.07239
作者: Malte Franke,Stefan P. Schmid,Zarko Ivkovic,Kjell Jorner,Andreas Krause
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The success of generative molecular design hinges on a model’s steerability toward high-reward samples. Because many molecular properties are intrinsically linked to molecular size, accurately capturing the joint distribution of properties and the number of atoms is essential. However, current diffusion and flow-based models fix the number of atoms, which ultimately limits their ability to navigate this complex relationship. To address this, we introduce Morph, a flexible-size generative model for conditional and unconditional 3D molecular design based on geometric graphs. By dynamically adapting size, Morph can seamlessly integrate existing structural priors, like scaffolds, and significantly enhances property steering. We show that Morph matches current fixed-size state-of-the-art models while offering the benefit of unparalleled sampling flexibility. We demonstrate out-of-distribution generation in regimes where previous models fail, paving the way for enhanced generative modeling for molecular design.

[LG-25] Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development

链接: https://arxiv.org/abs/2606.07207
作者: Zixi Li,Youzhen Li
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output’s spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.

[LG-26] owards Tight Bounds for Streaming Attention

链接: https://arxiv.org/abs/2606.07205
作者: Justin Y. Chen,Ying Feng,Piotr Indyk,Michael Kapralov,Ekaterina Kochetkova,Boris Prokhorov
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture explicitly stores all previously seen input elements (tokens) in order to generate the next one. The problem of implementing a transformer in limited space, known as KV cache compression, has received much interest over the past few years, spurring the development of powerful heuristics. Recent works of Haris et al, COLT’25 and Kochetkova et al, NeurIPS’25, formalized KV cache compression as the streaming attention approximation problem, providing both upper bounds (based on discrepancy theory) and information theoretic lower bounds. However, those papers left open a significant gap between the upper and lower bounds. For example, the space usage of their algorithms increases with the precision parameter, but the lower bound does not get stronger. In this work, we revisit the streaming attention approximation problem and provide nearly tight bounds on its space complexity. On the algorithmic side, we achieve the result through a surprisingly tight interplay between three distinct methods for kernel density estimation: discrepancy-based coreset constructions (e.g., Charikar-Kapralov-Waingarten’24), the polynomial method (e.g., Greengard-Rokhlin’87, Alman-Song’23), and space partitioning (e.g., Andoni-Laarhoven-Razenshteyn-Waingarten’17, Charikar-Kapralov-Nouri-Siminelakis’20). On the lower bound side, our main technical contribution is a new technique for using the INDEX problem with a large amount of side information that we hope will prove useful in other high dimensional geometric estimation problems. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2606.07205 [cs.DS] (or arXiv:2606.07205v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2606.07205 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging

链接: https://arxiv.org/abs/2606.07196
作者: Marco Morik,Xiao Ruiting,Shinichi Nakajima,Stefan Haufe,Ismail Huseynov
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Classical sparse Type-II Bayesian methods for M/EEG brain imaging support joint estimation of source and noise hyperparameters, but rely on fixed iterative update rules. Although these updates are principled and interpretable, their dynamics cannot be adapted from data. We propose to learn the update mechanism itself while preserving the underlying Bayesian structure by unfolding a classical joint hyperparameter-learning solver into a trainable neural architecture whose layers mirror the original iterations. The resulting framework is initialized to recover the classical solver exactly before training and is enriched through progressively more expressive correction-learning mechanisms, ranging from learnable biases to adaptive MLP and attention-based contextual refinements. In this way, training does not replace Bayesian inference with a black-box predictor, but instead learns structured correction terms while retaining the interpretability and model-based character of the original update dynamics. Structured correction learning therefore aims to improve empirical reconstruction performance without replacing the original model-based inference mechanism. Experimental results show that the learned correction variants improve reconstruction performance and convergence behavior over the baseline unfolded solver while preserving its algorithmic transparency.

[LG-28] No-Harm Physics-Informed Inverse Learning with Residual-Calibrated Uncertainty

链接: https://arxiv.org/abs/2606.07153
作者: Ronald Katende
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 25 pages, 10 Tables, 12 Figures

点击查看摘要

Abstract:Physics-informed learning is increasingly used for partial differential equation (PDE)-governed inverse problems, but its reliability remains difficult to certify. This paper develops a no-harm certification-and-selection framework for physics-informed inverse learning. A learned reconstruction is accepted only when its residual-calibrated radius is no worse than the baseline radius, namely when R_\mathrmlearn\le R_\mathrmbase+\varepsilon_\mathrmsafe; otherwise, the method returns the baseline. The certificate combines data, physics, boundary or initial-condition, and optimization residuals. Under a conditional stability estimate, these residuals yield an a posteriori reconstruction-error bound and a deterministic uncertainty radius. A high-probability certificate is also derived for physics residuals estimated from independent random collocation points. Numerical tests on Poisson source recovery, inverse heat reconstruction, limited-angle tomography, elliptic coefficient identification, and stochastic residual validation show that the selector accepts certified improvements, rejects shifted, hallucinated, or unfinished candidates, and becomes conservative in strongly ill-posed regimes. The framework is therefore a certification-and-selection layer, not another reconstruction architecture.

[LG-29] Geodesics of Dynamic Graphs for Regime Change Detection

链接: https://arxiv.org/abs/2606.07151
作者: William Cappelletti,Étienne Voutaz,Pascal Frossard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional change point detection in dynamic networks assumes abrupt transitions between stationary states, overlooking scenarios of continuous evolution which arise in most real-world applications, such as social networks or physical systems. We address this gap by formally defining regimes as periods of coherent dynamics in temporal graphs, which we characterize as trajectories along geodesics in a suitably defined graph space. This original perspective allows us to define regime changes as significant drifts in dynamics, either toward new trajectories or with pace changes. We leverage graph regression methods to measure the cumulative distance of sequences of observed graphs from the estimated geodesics between their endpoints, in the relevant graph space, which we can combine with change point detection algorithms. We present experiments on dynamic networks, with changing trajectories and varying speeds, in which we outperform state of the art change point detection models. Then, we analyse mobility data during the Covid-19 pandemic, and show that our assumptions on regular network evolution lead to change points that are more aligned to external events compared to the outcomes of baseline methods. Our work is the first to model and detect changes between evolving regimes in graph space, providing a realistic and powerful tool for analyzing complex temporal graph data.

[LG-30] Decision-Aware Evaluation of Physics-Informed Surrogates

链接: https://arxiv.org/abs/2606.07146
作者: Daniel Cieślak,Andrzej Czyżewski
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 12 pages, 5 figures, 9 tables. Code and data available at this https URL

点击查看摘要

Abstract:Physics-informed machine learning is often assessed by curve error, although engineering use depends on downstream decisions: ranking candidates, avoiding infeasible designs and limiting regret. We introduce pinn-gym, an open benchmark for material-conditioned lattice design that couples a transparent reduced-order crush-and-impact oracle with five printable polymer cards, dimensionless force-response targets and a protocol spanning curve fidelity, physical admissibility, top-k retrieval and mass regret. Across per-material, pooled and cross-material settings, low nRMSE is frequently insufficient to identify useful design selections. Physics-informed losses alter trade-offs rather than monotonically improving all metrics, and dimensionless conditioning improves comparability without making transfer symmetric. The benchmark is not a certified material model; within the released oracle, candidate generator and material cards, pinn-gym provides a reproducible testbed for evaluating PIML surrogates as decision systems rather than curve predictors alone. Comments: 12 pages, 5 figures, 9 tables. Code and data available at this https URL Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE) MSC classes: 68T07, 68T05, 65D15, 74S05 ACMclasses: I.2.6; I.2.8; J.2 Cite as: arXiv:2606.07146 [cs.LG] (or arXiv:2606.07146v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.07146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Explaining Unsupervised Disease Staging in Huntingtons Disease: Insights into Model Representations and Clusters ALT

链接: https://arxiv.org/abs/2606.07135
作者: Lubna Mahmoud Abu Zohair,Hind Zantout
类目: Machine Learning (cs.LG)
*备注: Accepted for oral presentation and as a full-length paper at the International Conference on AI in Healthcare 2026 (26-28 August 2026, Imperial College London) and will be published by Springer in the Lecture Notes in Computer Science (LNCS) series

点击查看摘要

Abstract:Huntington’s disease (HD) is a progressive neurodegenerative disorder that affects motor, cognitive, and behavioral functions, where accurate characterization of disease progression remains essential to improve patient outcome and quality of life. Unsupervised machine learning (ML) approaches have demonstrated the ability to uncover disease progression trajectories and meaningful latent stages from longitudinal data; however, their limited interpretability restricts clinical trust and translation. We extend a previously proposed ML-based disease staging framework by applying an explainability analysis to the extracted feature representations and discovered disease stages. Applied to the Enroll-HD dataset, we first project the learned representations into a lower-dimensional space to intuitively assess whether the resulting clusters align with the progression of established clinical measures. We then use saliency maps to identify the clinical features that most strongly contribute to the learned embeddings over time. Finally, we train a surrogate classifier and apply SHAP to quantify feature importance for cluster assignments and to analyze which clinical variables drive transitions between disease stages. The explainability analysis indicates that the learned embeddings capture clinically meaningful disease structure, aligning with established motor and functional severity scores and exhibiting progressive deterioration across clusters. Within this analysis, SHAP reveals a stratification of disease stages, ranging from early cognitive-motor impairment to severe functional dependency, consistent with known clinical progression patterns, while also highlighting intra-stage variability.

[LG-32] α-PFN: Fast Entropy Search via In-Context Learning ICML2026

链接: https://arxiv.org/abs/2606.07134
作者: Herilalaina Rakotoarison,Steven Adriaensen,Tom Viering,Carl Hvarfner,Samuel Müller,Frank Hutter,Eytan Bakshy
类目: Machine Learning (cs.LG)
*备注: Published at ICML 2026

点击查看摘要

Abstract:Information-theoretic acquisition functions such as Entropy Search (ES) offer a principled exploration-exploitation framework for Bayesian optimization (BO). However, their practical implementation relies on complicated and slow approximations, i.e., a Monte Carlo estimation of the information gain. This complexity can introduce numerical errors and requires specialized, hand-crafted implementations. We propose a two-stage amortization strategy that learns to approximate entropy search-based acquisition functions using Prior-data Fitted Networks (PFNs) in a single forward pass. A first PFN is trained to be conditioned on information about the optima; second, the \alpha -PFN is trained to predict the expected information gain by training on information gains measured with the first PFN. The \alpha -PFN offers a flexible learned approximation, which replaces the complex heuristic approximations with a single forward pass per candidate, enabling rapid and extensible acquisition evaluation. Empirically, our approach is competitive with state-of-the-art entropy search implementations on synthetic and real-world benchmarks, while accelerating the different entropy search variants across all our experiments, with speed ups over 50x. Source code: this https URL.

[LG-33] A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data

链接: https://arxiv.org/abs/2606.07128
作者: Zhuphua Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Raw numerical datasets remain less systematically examined in integrity screening than images, plagiarism, or summary-statistic inconsistencies. We developed the Fabrication-risk Digit Randomness Screening model (FDRS), a statistical and machine-learning framework for detecting non-random digit-pattern irregularities in numerical research data. FDRS integrates single- and joint-decimal-digit tests, Cramer’s V, entropy metrics, Kullback-Leibler divergence, digit-preference indices, progressive subsampling, and semi-supervised risk scoring. It was evaluated using an instrument-derived enzymatic absorbance dataset (RawData, n=253) and a blinded manually simulated irregular dataset (ErrData, n=255). RawData showed no significant deviation in single third-decimal-digit analysis, whereas ErrData showed a significant deviation. In joint third-fourth decimal digit analysis, ErrData showed higher Cramer’s V, lower normalized entropy, higher KL divergence, and a more persistent progressive-subsampling deviation signal. In internal validation, Elastic-net Logistic Regression achieved the highest AUC (0.98395) and lowest Brier score (0.048439), while Random Forest achieved the highest accuracy (0.926667) and balanced accuracy (0.935). RawData received a low ensemble risk score of 0.124627 and was classified as Grade 0; ErrData received a score of 0.740760 and was classified as Grade 3. External real-world benchmarks supported graded risk stratification: three datasets without identified public post-publication concerns were classified as Grade 0 or 1, whereas two datasets from publicly questioned or institutionally handled articles were classified as Grade 2 or 3. FDRS can prioritize raw numerical datasets for further review by integrating interpretable statistical and machine-learning features. It is an auxiliary digit-structure screening tool, not standalone evidence of fabrication or misconduct.

[LG-34] Learning Explicit Behavioral Models with Adaptive Questions and World-Model Probes

链接: https://arxiv.org/abs/2606.07127
作者: Hikaru Shindo,Yu Deng,Teng Cao,Quentin Delfosse,Christopher Tauchmann,Jannis Blüml,Gopika Sudhakaran,Kristian Kersting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interactive agents trained only against task return can achieve high scores while failing to represent the mechanisms that make their actions succeed. This makes brittle behavior difficult to diagnose and limits adaptation when environment dynamics change. Existing LLM reflection and policy-code repair can revise behavior from failed trajectories, but questions and world-understanding tests are usually used only after training. We introduce an Explicit Symbolic Behavioral Model (ESBM), a trainable behavioral model that couples task performance with evidence-grounded question answering and executable mechanism prediction. An ESBM represents behavior through typed predicates, weighted rules, bounded options and mechanism memory; the mechanism layer predicts symbolic events, object changes, rewards and terminal consequences under action interventions. After each rollout, adaptive questions and active world-model probes convert score failures, QA errors and transition-prediction errors into constraints for local ESBM edits. Candidate models are selected by a multi-criterion rule that jointly evaluates task score, answerability and active world-model consistency. Under the tested Atari-style protocols, ESBM learns high-scoring policies while producing explicit answers and executable mechanism predictions, indicating that adaptive questions can serve as both training pressure and reusable benchmarks for mechanistic policy learning in this setting.

[LG-35] Beyond Linear and Overcomplete Regimes: A Mean-Field Analysis of Bottleneck Autoencoders

链接: https://arxiv.org/abs/2606.07120
作者: Santanu Das,Ramyak Bilas,Pascal Esser,Satyaki Mukherjee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoencoders (AEs) learn low-dimensional representations by mapping data into a latent space while minimizing reconstruction error. Despite their empirical success, theoretical understanding remains limited and largely restricted to linear models or settings without a bottleneck. In this work, we study nonlinear AEs with a fixed finite-dimensional bottleneck in the mean-field (MF) regime. We derive explicit MF learning dynamics for both encoder and decoder, providing a tractable characterization of training in the nonlinear setting. We show that, over finite time horizons, the empirical risk of finite-width networks trained with stochastic gradient descent closely tracks the MF risk trajectory with high probability. At optimality, we further establish that the finite-width risk converges to the MF optimum, demonstrating that finite networks are sufficiently expressive to approximate the infinite-width solution.

[LG-36] he discovery of the effects of women employment participation on the fertility of developing countries: A panel data approach

链接: https://arxiv.org/abs/2606.07093
作者: Thi Kim Ngan Nguyen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The fertility trend in developing countries has experienced a significant decline in the last few decades; at the same time, the role of women in the workplace has improved. To have a better insight of the causality of the rate of women participation in the labor market on the total fertility rate in developing world, this paper divides the dataset of 115 developing countries in the period of 1991-2018 into four continents group (Africa, North/South America, Asia/Pacific, Europe) and then applies a data-driven panel data econometric procedure to mitigate omitted bias. The results suggest that the fertility behaviors of women in the North/South America continents are influenced by their career choice; meanwhile in society of other regions, other factors might be more important to women when thinking of having children. In conclusion, policymakers can reference to the paper and formulate policies to have more incentives in making reproductive decisions and further research in the field needs to consider family policies and patrilocality of developing countries as important data.

[LG-37] Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making

链接: https://arxiv.org/abs/2606.07088
作者: Kang Liu,Jianchen Hu,Ziyu Qu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Stochastic constrained decision-making requires optimizing performance objectives while enforcing statistical requirements such as safety or fairness. However, standard primal–dual methods struggle to update multipliers robustly under stochastic mini-batch feedback, as the noise of mini-batch gradients and constraint estimates can be directly accumulated into the multiplier memory. To address this issue, we propose Residual-Controlled Multiplier Learning (RCML), which reformulates multiplier updating as projected-pressure feedback. The central idea is to decompose the projected multiplier into an effective pressure signal for primal descent and a pressure-memory residual for finite-gain multiplier tracking. To handle heterogeneous and noisy observations, we further augment this residual-integral backbone with modular stochastic stabilization components. For the convex-affine backbone, we establish finite-gain convergence, derive a stochastic residual bound under mini-batch feedback, and show that the residual feedback law admits a local KKT-residual interpretation near regular KKT points of nonconvex problems. Experiments across optimization, allocation, and fair-ranking tasks show that RCML improves feasibility control and multiplier stability while maintaining competitive objective performance. Code is available here.

[LG-38] Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets Baselines and Experimental Design Choices

链接: https://arxiv.org/abs/2606.07068
作者: Malick Ebiele,Malika Bendechache,Rob Brennan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Since 1990 many feature selection methods have been proposed across heterogeneous applications. To validate the usefulness of a new method, it needs to be compared against at least one baseline method from the existing literature on a feature selection task using at least one dataset. Recent developments in tabular Deep Learning (DL) and data valuation in Machine Learning (ML) suggest that the evaluation of new methods, algorithms, and models may be consciously or unconsciously biased. We hypothesise that a similar trend exists in feature selection (FS), particularly in filter feature selection (FFS). The aim of this study is therefore to examine FFS studies to identify factors that influence the evaluation and that might consist entry point for biases in order to recommend stronger principles for FFS evaluation. Methods: We analyse a sample of 28 high profile FFS studies published between 1994 and 2025. The analysis provides reflections on how to examine FFS studies, highlights lessons learned throughout the process, and gives five evidence-based recommendations for future FFS evaluation. Results: Multivariate Linear Regression analysis achieved a score of R^2=0.33 . It means that 33% of the variance in the performance of new methods against chosen baselines (win rate) is explained by the number of datasets (#Datasets), the number of baselines (#Baselines), and the number of new methods (#NewMethods). Discussion: R^2=0.33 is considered medium explanation; which is promising given that this is the first such study. The medium explanation result is due to the fact that win rate is influenced by additional factors such as the maturity of the feature selection domain, the type of datasets and baselines, and the simplicity of the regression model used to explain the relationship. Subjects: Machine Learning (cs.LG) MSC classes: 68T01, 68T20 ACMclasses: I.5.2; I.1.2; H.1.1 Cite as: arXiv:2606.07068 [cs.LG] (or arXiv:2606.07068v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.07068 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Malick Ebiele [view email] [v1] Fri, 5 Jun 2026 09:06:56 UTC (1,661 KB)

[LG-39] Hierarchical Forecast Reconciliation for Urban Rail Transit Demand Prediction under Operational Disruptions

链接: https://arxiv.org/abs/2606.07044
作者: Dang Viet Anh Nguyen,Alma Fazlagic,Kristine Pryds Loft,Filipe Rodrigues
类目: Machine Learning (cs.LG)
*备注: 33 pages, 6 figures, 16 tables

点击查看摘要

Abstract:Accurate and coherent passenger demand forecasting is essential for Urban Rail Transit (URT) operations. Passenger demand has a hierarchical structure in which origin-destination (OD) flows aggregate to station-level inflows and outflows through conservation constraints. In practice, station-level and OD-level forecasts are often generated independently, producing incoherent predictions that violate these constraints and introduce inconsistencies into operational decision-making. Such issues become more severe during disruptions, when forecasting reliability is most critical. This paper presents the first hierarchical forecast reconciliation framework for joint station-level and OD-level URT demand prediction. A neural Fully Connected Reconciler (FCR) learns a non-linear mapping from incoherent base forecasts to coherent hierarchical predictions while guaranteeing exact structural consistency by construction. The method is benchmarked against OLS, WLS, and Minimum Trace (MinT) variants using Rejsekort smart-card data from the Copenhagen S-train network under one-step, multi-step, and disruption forecasting scenarios. Results show that reconciliation consistently improves OD forecasting accuracy while ensuring hierarchical coherence. Under normal conditions, FCR performs competitively with MinT-based methods. An oracle analysis indicates that perfect station-level forecasts could reduce OD prediction error by up to 34 percent, highlighting the value of improved base forecasts. Under severe disruptions, FCR outperforms classical methods, reducing OD forecasting error by up to 17.45 percent in multi-step destination-side delay scenarios. These findings establish hierarchical reconciliation as an effective mechanism for improving forecast robustness, with the largest benefits occurring under the most challenging operating conditions.

[LG-40] CF-JEPA: Mask-free forward prediction with asymmetric encoder utilization for time-series representation learning

链接: https://arxiv.org/abs/2606.07031
作者: Jaehoon Lee,Sunghyun Sim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) for time-series representation learning is dominated by two paradigms: contrastive methods, which face challenges in constructing positive or negative pairs, and masking-based methods, which disrupt the temporal continuity of time-series signals. Joint-Embedding Predictive Architecture (JEPA) offers a promising alternative by predicting in representation space rather than reconstructing raw inputs. However, existing time-series JEPA variants still rely on masking and therefore inherit its continuity problem. Crop-based Forward JEPA (CF-JEPA) is proposed as an innovative mask-free framework that replaces masking with multi-horizon forward prediction: random crops serve as context views, and short-, mid-, and long-horizon future representations are predicted in the forward temporal direction, directly leveraging the inherent temporal ordering of time-series data as a learning signal. A strong asymmetry is also identified between the online encoder and the exponential moving average (EMA) target encoder, both produced from a single training run: the online encoder develops higher-rank discriminative features, while the EMA target encoder develops smoother, lower-rank temporal features. Exploiting this asymmetry, classification is routed to the online encoder and forecasting or anomaly detection to the EMA target encoder, achieving a 27% reduction in multivariate forecasting mean squared error (MSE) at no additional training cost. Across 126 University of California, Riverside (UCR) and 26 University of East Anglia (UEA) classification datasets, eight electricity transformer temperature forecasting benchmarks, and Key Performance Indicator /Yahoo anomaly detection, CF-JEPA achieves the highest average accuracy and rank on UCR and UEA among self-supervised baselines and ranks second on univariate forecasting and k-nearest neighbors-scored anomaly detection.

[LG-41] Accelerating Reproducible Research in Synthetic EHR Generation

链接: https://arxiv.org/abs/2606.06990
作者: Jalen Jiang,Chufan Gao,Ethan Rasmussen,Stephen Z. Xie,Jimeng Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The generation of high-fidelity synthetic Electronic Health Records (EHR) is crucial for advancing medical research while preserving patient privacy. However, head-to-head comparison of existing generative models is hindered by disjointed codebases, incompatible data loaders, conflicting library dependencies, and inconsistent evaluation protocols. To address these gaps, we introduce a lightweight, end-to-end benchmarking framework for reproducible synthetic EHR evaluation, organized as a unified pipeline spanning data ingestion, standardized model training, and architecture-agnostic evaluation. Our current implementation targets the generation of longitudinal ICD diagnosis codes – the most commonly studied modality in this literature – and is built on the community-maintained PyHealth library. We reimplement and unify strong baselines (MedGAN, CorGAN, PromptEHR, HALO) under full ICD-9 vocabulary granularity, and add a lightweight GPT-2 baseline from the general-purpose sequence-modeling literature. We contribute a rigorous, architecture-agnostic privacy-utility evaluation suite that applies identically to GAN- and transformer-based generators, and report bootstrapped confidence intervals across all metrics. We further analyze the poor long-tailed performance of existing models and discuss the extensibility of our framework beyond diagnosis codes. By lowering the engineering barrier to running, extending, and evaluating under a single pipeline, we introduce a starting point for community-driven reproducibility and benchmarking synthetic EHR models.

[LG-42] Heterogeneous Effects of Green Finance on Urban Decarbonization: Evidence from 285 Cities in China

链接: https://arxiv.org/abs/2606.06986
作者: Xueyang Li,Jinlei Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While green finance has become a key instrument for low-carbon city transitions, its actual decarbonization effects and transmission mechanisms remain unclear. This study employs econometric models and machine learning-based analysis to examine whether and how green finance reduces city-level carbon intensity. Results show that green finance significantly lowers carbon intensity, with green bonds and green investment having the strongest impacts and evident spatial spillovers. The effects vary by development level, being most pronounced in Fourth- and Fifth-tier cities. Mediation analysis reveals that green finance operates mainly through energy structure optimization, followed by industrial upgrading, foreign direct investment, and technological innovation. SHAP analysis confirms substantial differences across financial instruments, with green bonds, funds, and credit contributing most to decarbonization. Moreover, the marginal impact is stronger in cities with low technological capacity, high industrial dependency, and coal-based energy mixes. These findings provide theoretical support and policy guidance for building a multi-level, regionally differentiated green finance system to promote inclusive low-carbon transitions. Keywords: Green Finance; Carbon Intensity; Decarbonization Effect; Machine Learning; City

[LG-43] Accelerating Multi-Objective Bayesian Optimisation via Predictive-Gradient Catalysts PPSN

链接: https://arxiv.org/abs/2606.06984
作者: Alma Rahat,Tinkle Chugh,Jonathan Fieldsend,Richard Allmendinger
类目: Machine Learning (cs.LG)
*备注: Parallel Problem Solving From Nature (PPSN), 2026

点击查看摘要

Abstract:This paper presents a general acceleration mechanism for multi-objective Bayesian optimisation (MOBO) that leverages Gaussian process predictive gradients as auxiliary signals. Rather than replacing existing Pareto-compliant acquisition functions, the proposed approach augments them with local stationarity information derived from surrogate-derived gradients, enabling faster convergence toward the global Pareto set under limited evaluation budgets. Two catalyst instantiations are investigated: an adaptive Multiple-Gradient Descent Algorithm-Based Catalyst (MGDA) and a predefined-weight variant that enables focused exploration when budgets are tight. Experiments on the DTLZ benchmark suite (using 2 objectives and 10 decision variables) show that predictive gradient catalysis can deliver significant acceleration compared to other acquisition functions (EHVI, AugTch, tMPoI, SAF) when surrogates are accurate, particularly for stationary problems.

[LG-44] GenPO: Generative Policy Optimization with Jacobian-free Likelihood Ratios

链接: https://arxiv.org/abs/2606.06967
作者: Ke Hu,Shutong Ding,Panxin Tao,Jingya Wang,Ye Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative policies provide expressive and multimodal action distributions, making them attractive for reinforcement learning (RL) in complex continuous-control tasks. Among them, flow-based policies are especially appealing because they generate actions through deterministic transport maps. However, applying such generative policies to likelihood-based on-policy learning remains limited by the difficulty of evaluating the probability of executed actions. Existing flow RL methods either replace the true action-density ratio with approximate surrogates, which can introduce biased updates, or recover exact likelihoods through dummy-action augmentation, which enlarges the policy space and increases computation. In this work, we propose GenPO++, a reversible generative policy optimization framework that uses history states as auxiliary memory in a high-order reversible ODE solver, yielding exact inversion without changing the original action dimension. The resulting generative policy map has a log-determinant determined only by fixed solver coefficients, enabling exact and Jacobian-free likelihood-ratio computation. This design preserves the expressiveness of generative flow policies while avoiding both action ratio bias and dummy-action overhead. We evaluate GenPO++ on large-scale simulated control, fine-tuning, and real-world robotic manipulation tasks, where it achieves competitive or superior performance over state-of-the-art on-policy RL methods, while improving training stability and computational efficiency.

[LG-45] Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters

链接: https://arxiv.org/abs/2606.06934
作者: Jonghyun Shin,Sejun Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We analyze generalization error, uniform stability, and uniform argument stability of gradient descent (GD) and stochastic gradient descent (SGD) over discrete parameter spaces, where each update involves deterministic or stochastic rounding. We show that deterministic rounding degrades the generalization error of GD on convex, Lipschitz, and smooth loss functions, increasing the rate from O(T/n) to O(T/\sqrtn) , and establish matching lower bounds. We further prove that uniform stability of GD becomes \Omega(T) , showing that stability-based generalization bounds are vacuous in this setting. In contrast, for the same losses, stochastic gradient descent with deterministic rounding admits nontrivial uniform stability guarantees, which differ qualitatively from the real-valued case and exhibit distinct dependencies on the number of iterations and the dimension: we prove tight bounds O(T/n) for one dimension and O(T^2/n) for higher dimensions. We also show that stochastic rounding can introduce generalization error that increases with the dimension; such a phenomenon is absent in standard real-valued optimization and in the deterministic rounding case. Finally, we provide upper bounds on uniform argument stability for stochastic rounding schemes and show that these bounds are tight when the loss can be represented as a sum of coordinate-wise functions.

[LG-46] From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing

链接: https://arxiv.org/abs/2606.06924
作者: Guannan Lai,Haoran Hu,Long Chen,Zhenguo Li,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing LLM routing methods typically treat a model’s single response to a query as its capability label for training routers. However, because LLM generation is inherently stochastic, such single-shot supervision provides only a noisy observation of a query-model pair’s behavior rather than a reliable capability estimate. We show that this assumption introduces systematic noise into routing supervision, making learned routing policies less reliable. To address this issue, we propose DARS (Distribution-Aware Routing Supervision), a framework that constructs routing supervision from a distributional view of model behavior. Instead of relying on a single generated response, DARS considers uncertainty from both the input side and the output side, capturing how semantically equivalent query formulations and stochastic generations affect model performance. Based on these distribution-aware observations, DARS builds more reliable supervision signals for routing. Experiments across diverse tasks show that single-shot labels can be misleading for model selection, while distribution-aware supervision provides more stable labels and improves learned routing behavior. Our results suggest that reliable LLM routing should move beyond single-response observations and be grounded in query-level model capability distributions.

[LG-47] ALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

链接: https://arxiv.org/abs/2606.06902
作者: Chengkai Zhang,Ziteng Liu,Junpu Wang,Zeyi Tao,Yang Wang,Sagar Chordia,Qin Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Targeted post-training aims to improve reasoning, math, and code without degrading strengths. Low-rank adapters are efficient but task-global; activation interventions are input-aware but often require separate probes, vectors, or inference-time steering. We introduce TALAN (Task-Aligned Latent Adaptation Networks), a sequence-conditioned latent side path inserted into a transformer’s residual stream and co-trained with a low-rank adapter in one SFT loop. TALAN compresses the active sequence into latent memory, remixes it into token-level perturbations, and writes them back through a controlled residual update. It is configured along six axes: insertion location, memory size, mixer, writeback rule, trainability scope, and gradient scale. Across four Qwen3-family backbones and four STEM/code benchmarks, TALAN improves matched LoRA and DoRA baselines. With LoRA, it yields a +1.41 point cross-model mean gain, positive on all four backbones and non-negative on all 16 model-benchmark cells. With DoRA, it yields a +1.85 point mean gain, positive on all backbones and on 13 of 16 cells. Paired seed checks support positive average effects but show nontrivial variance, so we treat them as sensitivity checks. Cost is small: 1% trainable parameters relative to the backbone and 1.01-1.02x inference overhead versus matched LoRA. A Llama-3.2-1B transfer probe is also positive under LoRA and rsLoRA across seven paired seeds, supporting a transfer beyond Qwen. Internal-state analyses suggest TALAN is a small complementary activation intervention. The matched adapter update is 80-1,700x larger than the TALAN perturbation, yet their directions have near-zero cosine; per-layer measurements show this small orthogonal perturbation propagates and amplifies through depth. TALAN offers a practical platform for studying steerable activation-level adaptation within standard adapter-based post-training. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.06902 [cs.LG] (or arXiv:2606.06902v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06902 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-48] GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

链接: https://arxiv.org/abs/2606.06892
作者: Yue Min,Ruining Chen,Yujun Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and complementary coverage. In this work, we reframe attribution as subset-level counterfactual utility prediction and introduce GRASP, an interaction-aware surrogate. Grounded in a theoretical smoothness lower bound, GRASP explicitly models subset interactions through a quadratic geometric penalty. To achieve pretraining-scale efficiency without relying on hidden oracle tuning, we couple low-dimensional feature sketches with a strictly finite lower-confidence bound selection protocol. Extensive subset-retraining evaluations demonstrate that GRASP decisively outperforms existing scalable baselines. It more than doubles the task-level rank correlation for counterfactual subset fidelity while reducing upfront artifact construction costs by nearly an order of magnitude. Downstream diagnostics further show that this scoring mechanism transfers to language model curation and cross-domain vision selection, establishing a robust foundation for optimizing massive pretraining corpora.

[LG-49] Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

链接: https://arxiv.org/abs/2606.06888
作者: Zhiwei Xu,Shihao Wu,Hanseul Cho,Wei Hu,Yixin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical scaling laws for language model pretraining balance model size against training dataset size under a fixed compute budget, assuming abundant data and a single pass over the corpus. As training compute grows faster than the supply of natural language data, pretraining is likely to enter a data-constrained, compute-rich regime where models train for multiple epochs over a finite dataset. We study data-constrained pretraining along two axes, regularization and scaling. For regularization, we study masked-input regularization (MIR), an auxiliary next-token prediction loss on randomly masked inputs. MIR tests whether the random masking central to diffusion language models can benefit autoregressive pretraining without architectural changes or inference overhead. Across 72M to 1.4B parameter models, we find that MIR added on top of strong weight decay improves validation loss over autoregressive strong-weight-decay-only models, with downstream gains at 1.4B. For scaling, we propose SoftQ, a scaling law that couples model size and data size to capture their interaction under repeated data. Classical alternatives such as the Chinchilla law use an additive form that decouples these terms, making them misspecified in the data-constrained regime. We find that SoftQ fits data-constrained experiments substantially better than these alternatives, and estimates MIR’s gains as equivalent to roughly 1.3 times as much unique training data. We release our code at this https URL.

[LG-50] GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

链接: https://arxiv.org/abs/2606.06881
作者: Baiying Lu,Zhaohui Liang,Ryan Pontius,Shengpu Tang,Temiloluwa Prioleau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Blood glucose forecasting models are foundational for modern diabetes management systems, as reliable short-term predictions can enable proactive interventions, support automated insulin delivery, and reduce the risk of hypo- and hyperglycemic events. From a modeling perspective, glucose forecasting poses unique challenges due to heterogeneous physiological dynamics across diabetes populations. Traditional machine learning and deep learning models have been extensively evaluated for glucose prediction, yet recent time-series foundation models (TSFMs) remain much less studied in this setting. To bridge this gap, we present GlucoFM-Bench, a comprehensive benchmark evaluating state-of-the-art TSFMs alongside supervised deep learning models for blood glucose forecasting. We assess eight representative architectures, including pre-trained TSFMs, time-series large language models, and task-specific deep learning models, across 15 publicly available diabetes-relevant datasets comprising 1,117 individuals with type 1 diabetes, type 2 diabetes, prediabetes, and no diabetes. Models are evaluated under zero-shot, few-shot, and full-shot protocols, with systematic variation in context length and prediction horizon. Across datasets, pre-trained TSFMs, especially Chronos-2 and TimesFM, show strong zero-shot and few-shot transfer, with the best zero-shot model performing within 5% of the best full-shot supervised model. Yet, when task-specific data are abundant, a lightweight LSTM remains strongest, outperforming TSFMs by 4–21% under full-shot training. Stratified analyses reveal persistent challenges in T1D cohorts and hypo-/hyperglycemic ranges, highlighting the need for evaluation beyond aggregate error metrics. Together, GlucoFM-Bench provides a standardized and reproducible foundation for evaluating, comparing, and improving foundation models for blood glucose forecasting.

[LG-51] Evidence-Grounded Ensemble Diagnosis of 802.11 Packet Captures: A Multi-Stage Pipeline with Deterministic Reliability Scoring

链接: https://arxiv.org/abs/2606.06871
作者: Jerome Henry,Swadhin Pradhan,Miroslav Popovic
类目: Machine Learning (cs.LG)
*备注: 37 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Diagnosing 802.11 packet captures requires expert protocol knowledge, is slow, inconsistent across engineers, and unscalable. LLM-based approaches sound plausible but fabricate protocol events absent from captures (especially truncated traces), produce uncalibrated confidence scores, and suffer evaluation bias when golden references are co-produced by the model under test. We introduce PROBE (Protocol Reasoning Over evidence-Based Ensembles), a multi-stage pipeline addressing all three failures. It integrates (i) deterministic PCAP-to-text normalization with frame-level verifiability, (ii) multi-run, multi-candidate ensembles with optional cross-model second opinion and progressive obfuscation, (iii) a verdict-aware evidence framework treating absence of failure evidence as contributing evidence, and (iv) a fully deterministic composite reliability score from evidence validity, run-to-run stability, and cross-model agreement without LLM self-assessment. On 87 enterprise Wi-Fi captures (104 capture-reviewer pairs), single-pass LLM analysis raises weighted evidence F1 from 0.871 (expert baseline) to 0.912 but misses critical frames in 35% of cases. Naive ensemble voting drops below baseline (0.842) as majority voting amplifies conservative verdicts: 50% of confirmed failures are misclassified as ‘no issue’ or ‘insufficient evidence.’ Adding evidence-grounded reconciliation achieves 0.957 F1, a 96% auto-accept rate, and a worst-case floor above 0.70. LLM self-reported confidence clusters at 0.95 regardless of difficulty (71% report exactly 0.95), confirming it is uninformative. We also introduce a model-agnostic evaluation framework using per-field assertion matching, eliminating circular bias from model-co-produced golden references. Comments: 37 pages, 9 figures, 9 tables Subjects: Machine Learning (cs.LG) ACMclasses: C.2.1; I.2.1; C.4 Cite as: arXiv:2606.06871 [cs.LG] (or arXiv:2606.06871v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06871 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Product units in gated recurrent units improve nuclear-mass prediction CCS2026

链接: https://arxiv.org/abs/2606.06866
作者: Ziyuan Li,Paulo S.A. Freitas,John W. Clark,Babette Dellen
类目: Machine Learning (cs.LG); Nuclear Theory (nucl-th)
*备注: Accepted at ICCS 2026

点击查看摘要

Abstract:The prediction of masses of atomic nuclei using machine learning can complement theoretical models and advance the exploration of poorly known domains of the nuclear chart. We propose a machine learning technique based on gated recurrent units (GRU), which have demonstrated competitive performance in nuclear-mass prediction by exploiting long-term dependencies. By integrating multiplicative interactions and product-unit transformations within recurrent units, we report significant improvements in nuclear-mass prediction. Computations are performed in the complex domain to jointly capture amplitude and phase dynamics. For interpolation and temporal-extrapolation tasks based on the atomic mass evaluation (AME2016 and AME2020), the complex additive-multiplicative product-unit gated recurrent unit (AM-PU-GRU) model consistently achieves the lowest prediction errors, with an interpolation RMSE of 0.227 \pm 0.004 MeV and an extrapolation RMSE of 0.179 \pm 0.015 MeV. These results surpass other state-of-the-art machine learning models and also outperform the real-valued GRU baseline and product-unit ablation variants, while remaining robust to different theoretical priors, including WS4 and SEMF. Our findings establish complex-valued product-unit recurrent networks as a new benchmark for sequence-based nuclear-mass prediction.

[LG-53] he Geometry of Last-Layer Model Stealing

链接: https://arxiv.org/abs/2606.06854
作者: Snigdha Chandan Khilar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper uses geometry to explain how a machine learning model can be stolen using an already existing well-known method. The author has shown the exact conditions required to perfectly copy the final layer of a transformer network. When looking deeper into the hidden layers the author has explained clear limits. The author has also demonstrated that a hidden network cannot be fully reverse engineered just by looking at the final results. The research clearly maps out what can and cannot be stolen from a model.

[LG-54] Learning Fair Demand Models

链接: https://arxiv.org/abs/2606.06830
作者: Adam N. Elmachtoub,Hyemi Kim,Jonathan Y. Tan
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-driven pricing is increasingly prevalent in sectors such as airlines, lending, insurance, and retail. By learning demand models from customer features and setting prices accordingly, these systems may generate discriminatory outcomes that raise fairness concerns. This leads to fundamental questions - how and where should systems incorporate fairness considerations in the pricing pipeline, and how does it ultimately affect societal outcomes? To answer these, we study a stylized model where a seller has a two-stage decision pipeline comprising linear demand model estimation followed by price optimization. The seller considers fairness notions in training loss, price, and demand, under both parity-wise and Rawlsian perspectives. We show that equalizing training loss across consumer groups leads to multiple solutions, which in turn can result in undesirable outcomes despite being a standard approach in fair machine learning. Focusing instead on fairness applied directly to prices or demand, we compare two strategies that enforce fairness in either the demand estimation stage or the price optimization stage. For parity-wise fairness, we characterize when each strategy yields higher social welfare under small fairness levels. We show that when market sizes and prices in the dataset are similar, imposing price fairness in the estimation stage is more beneficial to consumers, whereas imposing demand fairness in the optimization stage yields better consumer outcomes. For Rawlsian fairness, the two strategies coincide exactly. Lastly, we extend our model to alternate demand functions and conduct a case study using real-world vaccine pricing data. Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2606.06830 [cs.CY] (or arXiv:2606.06830v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2606.06830 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-55] Architecture Shapes Transfer Specificity in Implicit Neural Representations

链接: https://arxiv.org/abs/2606.06827
作者: D Yang Eng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer in coordinate networks is often measured by warm-start gain, but whether that gain reflects source-specific structure or generic weight reuse is less clear. We study this question across three implicit neural representation (INR) families, SIREN, ReLU MLPs, and Fourier-feature MLPs, using controlled analytic tests, a 2D lid-driven-cavity Navier–Stokes benchmark, and 1D PDE reference-solution suites for heat, viscous Burgers, and focusing cubic NLS. The analytic tests use independent-seed random controls, while the PDE benchmarks use alternate same-family source controls and auxiliary ablations. Across settings, transfer magnitude and transfer specificity separate clearly. In a 10-seed controlled 1D geometric test, Fourier Features show the largest structured transfer ( 33.1\times ), followed by SIREN ( 23.0\times ) and ReLU ( 10.7\times ), but ReLU is far more selective: random-control transfer is 0.41\times for ReLU versus 14.24\times for SIREN. On a controlled two-parameter 1D family, the ranking changes: ReLU gives the clearest structured-versus-control separation at default settings, whereas Fourier Features improve only after bandwidth retuning. In Navier–Stokes and the broader 1D PDE suite, no single architecture dominates every equation, yet the same pattern remains: SIREN often reuses weights broadly, whereas ReLU and, in some equations, Fourier Features are more source-selective. Static diagnostics remain weak, and the heuristic scaling law A_\texttransfer \propto 1/\Delta t^2 is rejected in the implemented 1D audit. These results position transfer specificity as a useful diagnostic for coordinate networks and suggest that architecture selection in scientific machine learning should be evaluated under explicit control conditions, not by transfer magnitude alone. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.06827 [cs.LG] (or arXiv:2606.06827v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06827 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] rastal: Layer-Variant-based Scheduling for Real-Time Multi-DNN Workloads on Heterogeneous Accelerators

链接: https://arxiv.org/abs/2606.06818
作者: Sing-Yao Wu,Fengshuo Song,Eli Bozorgzadeh
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures. Accepted by RTCSA 2026. Author accepted manuscript

点击查看摘要

Abstract:Heterogeneous DNN accelerators improve soft real-time multi-DNN execution by mapping each layer to its preferred accelerator to reduce latency. However, under skewed workloads, large layer-latency differences across accelerators limit scheduling flexibility and increase deadline misses. To address this challenge, we introduce layer variants, customized layer implementations that reduce latency gaps on non-preferred accelerators. We then present Terastal, a soft real-time framework for layer-variant design and scheduling on heterogeneous DNN accelerators. Terastal combines offline heterogeneity-aware virtual budget assignment and layer-variant design, and online scheduling to jointly optimize accelerator mapping and variant selection under timing and accuracy constraints. Experimental results show that Terastal reduces deadline miss rate per model by 40.58%, 30.53%, and 36.27% compared with FCFS, EDF, and DREAM, respectively, while incurring only 2.24% average normalized accuracy loss across models with variants.

[LG-57] AMD-FCG: An Enhanced Function Call Graph Dataset with Integrated Topological Features for Malware Detection and Classification

链接: https://arxiv.org/abs/2606.06815
作者: Parthajit Borah,Sakshi Singh,D.K. Bhattacharyya,J.K. Kalita
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As malware illustrates a complex structure and behavior, detection of these has been a significant challenge in the domain of cybersecurity along with related services in daily life. So, it becomes crucial to have a reliable and adaptive solution to address the issue. Among the several detection methods developed over the years, one of the most reliable ones is studying and analyzing the structural and behavioral patterns of malware. These patterns of sophisticated malware can be obtained with the help of Function Call Graphs (FCGs). However, to effectively cover numerous groups of families of malware, it is required to have a sufficiently large dataset for the system to operate on. In order to ensure accuracy and robustness of the system, the dataset should comprise samples of different malwares and a benign application for secure execution of the detection process. This paper introduces AMD-FCG, an enhanced Function Call Graph dataset integrated with topological features of malwares. The framework enhances the detection procedure, streamlining the workflow for cybersecurity professionals and also eliminating the need for dynamic analysis and extensive processing. Therefore, it can be used to develop and deploy more efficient and innovative malware detection systems.

[LG-58] Interpreting Learning Under Competing Models: Joint and Stepwise Approaches for Dynamic Cognitive Diagnosis

链接: https://arxiv.org/abs/2606.06804
作者: Yawen Ma,Sahoko Ishida,Kate Cain,Gabriel Wallin
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Digital learning environments record learners’ responses to individual items, making it possible to study the development of specific skills rather than overall scores. Drawing conclusions about learning from these data requires a model that links responses to latent skills and tracks how mastery changes over time. When the skills measured by each item are unknown, the analyst must decide whether to estimate this structure, the Q-matrix, jointly with the learning process, or to establish it first and study learning afterwards. We show that this decision can change substantive conclusions about how learners develop. Using dynamic cognitive diagnostic models, we analyse data from two reading games measuring vocabulary and comprehension from Grade 2 to Grade 3, with item-text embeddings providing prior information for the unknown Q-matrix. A joint analysis and a bias-corrected stepwise analysis agree that most learners move toward mastering both skills, but disagree about how many remain only partially proficient at Grade 3, changing how reading progress would be reported. A simulation study identifies when the two analyses diverge and shows that joint analysis is more reliable when the item-skill structure is uncertain and the item pool changes between grades. We provide R code for both analyses.

[LG-59] Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension

链接: https://arxiv.org/abs/2606.06790
作者: Arthur Bouton,Tristan D. Hasseler,Michael Paton,Travis Brown,Jacob Levy,William Reid,Joshua Martin,Hari Nayar
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 21 pages, 26 figures

点击查看摘要

Abstract:This paper presents ERNEST, a four-wheeled planetary rover concept equipped with a two-degree-of-freedom Active Gimbal Suspension that combines yaw and roll actuation to enable wheel reconfiguration, steering, and active load redistribution. A single neural network controller, trained to track a desired path across challenging terrain, fully unlocks the capabilities of this actuated suspension system for autonomous obstacle negotiation. A reinforcement learning framework is developed using the high-fidelity DARTS simulation engine, which combines rigid-contact dynamics and Bekker-Wong terramechanics, enabling the emergence of locomotion strategies adapted to loose-soil conditions. To obtain a single unified controller across heterogeneous terrains, a policy consolidation strategy merges the experience of terrain-specialized agents into one neural network, eliminating the need for explicit terrain classification and controller switching. The resulting controller operates on a combination of proprioceptive and exteroceptive feedback, including sparse stereo-derived terrain elevation, chassis attitude, joint states, and force-torque measurements. Zero-shot transfer to the physical rover is achieved through domain randomization, sensor noise injection, and model-to-real system identification. Experimental results demonstrate autonomous traversal of rock fields, a bump trap, a wheel-high step, sand ripples, and sandy slopes. On a 20° sandy slope, the learned controller reduces the cost of transport by 37% on dry sand despite the additional actuation, and achieves superior performance on wet sand where the passive suspension becomes completely immobilized.

[LG-60] Federated Foundation Models over Vehicular Networks

链接: https://arxiv.org/abs/2606.06786
作者: Kasra Borazjani,Fardis Nadimi,Payam Abdisarabshali,Owen Palinski,Allan Salihovic,Dinh Nguyen,Minghui Liwang,Seyyedali Hosseinalipour
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:This paper presents a forward-looking vision for integrating the emerging multi-modal multi-task federated foundation models (M3T FedFMs) into vehicular networks, with the goal of unifying the expressive power of multi-modal multi-task foundation models (M3T FMs) with the privacy-preserving and distributed learning capabilities of federated learning (FL). Given the largely underexplored nature of this research direction, we first introduce the fundamental training/fine-tuning principles of M3T FedFMs. We then discuss a range of their representative use cases in vehicular networks, illustrating the significant potential of M3T FedFMs to enable next-generation vehicular intelligence. Afterwards, we identify key constraints inherent to vehicular environments that challenge the practical deployment of M3T FedFMs, and articulate a set of forward-looking research directions to address these challenges. Furthermore, through a case study conducted on a real-world vehicular dataset (i.e., Waymo Open Dataset), we demonstrate the promise of M3T FedFMs for vehicular networks and release our implementation to facilitate reproducibility and stimulate research in this emerging area (repository: this https URL)

[LG-61] he Sharp Phase Transition of Tylers M-Estimator for Robust Subspace Recovery

链接: https://arxiv.org/abs/2606.06782
作者: Gilad Lerman,Teng Zhang
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Robust Subspace Recovery (RSR) aims to identify an underlying d-dimensional subspace from a dataset heavily corrupted by outliers. Complexity-theoretic results establish a threshold for the problem’s computational hardness based on the dimension-scaled signal-to-noise ratio (DS-SNR): the problem is SSE-hard when the DS-SNR is strictly less than 1, and solvable via practical algorithms when it is greater than 1 under general position assumptions. However, the exact behavior of practical algorithms at the critical boundary DS-SNR = 1 has remained unknown. This work resolves the behavior of Tyler’s M-estimator (TME) at this critical boundary, consequently establishing a sharp phase transition. Specifically, we prove that TME converges exactly to the true subspace for DS-SNR \geq 1 under a new stability condition, which is less restrictive than the general position assumptions used in prior literature. Our analysis utilizes a decomposition of the TME iterates within a majorization-minimization framework.

[LG-62] A Rolling-Window Framework for Churn Prediction and Behavioral Driver Identification

链接: https://arxiv.org/abs/2606.06776
作者: Muhammad Jawad Mufti,Omar Hammad,Haitham Saleh,Muqaddas Gull
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Customer churn prediction is a central task in customer analytics, particularly in non-contractual, pay-per-use service environments where disengagement is not explicitly observed and must be inferred from behavioral inactivity. Existing churn prediction approaches often rely on simplified temporal assumptions or single-point representations of customer behavior, which limit their ability to support continuous risk assessment, interpretability, and realistic deployment over time. This study proposes a temporally explicit churn prediction framework that models customer behavior using rolling behavioral windows, enabling repeated and instance-level churn risk estimation as customer activity evolves. Customer behavior is summarized within a fixed 30-day observation window, followed by a 30-day future churn evaluation window, ensuring a clear temporal separation between behavioral evidence and churn outcomes. The framework integrates feature-based and sequence-based learning approaches within a unified temporal design. The proposed approach is evaluated on a large-scale, real-world dataset from a non-contractual service platform. Empirical results demonstrate strong and stable predictive performance, with accuracy reaching 87.6% and ROC-AUC of 0.94 for the feature-based model, while the sequence-based model achieves recall as high as 96.1% by capturing temporal disengagement patterns. Evaluation on future unseen data confirms meaningful robustness under temporal shift, with accuracy remaining above 83% and ROC-AUC exceeding 0.91 without model retraining. Overall, the findings highlight that carefully designed temporal framing, rather than model complexity alone, is critical for achieving robust, interpretable, and deployment-ready churn prediction. The study provides a practical foundation for churn-oriented decision support in dynamic service environments.

[LG-63] Performance Variation in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2606.06746
作者: Haruto Tanaka,A. Rupam Mahmood
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning (RL) algorithms often suffer from low run-to-run robustness, manifesting as significant performance variation across independent runs of identically configured agents. Although this issue poses a spectrum of challenges across research and practice, relatively few studies develop methods to evaluate it; RL research instead often reports uncertainty in the estimated mean performance. In this paper, we outline the limitations of conventional uncertainty and variation estimates, particularly their misalignment with purpose and the risk of underreporting. We then propose an alternative percentile-based statistic and visualization method, min-max IPR and run-wise percentile highlighting, respectively. These percentile-based tools are easy to interpret and rely on standard properties of sample percentiles, providing rich information about run-to-run performance variation. We demonstrate this through three case studies. First, we show that LayerNorm and penultimate-layer normalizations narrow performance variation in PPO, whereas the variation is mostly unchanged in SAC. Second, we compare PPO, SAC, TD-MPC, and TD-MPC2, and show TD-MPC exhibits the least variation while being the most data efficient among the four. Finally, in a comparison of DQN and Rainbow on five Atari environments, we show that both algorithms exhibit similar levels of performance variation.

[LG-64] orchKM: A GPU-Oriented Library for Kernel Learning and Model Selection

链接: https://arxiv.org/abs/2606.06742
作者: Yikai Zhang,Gaoxiang Jia,Jie Ding,Boxiang Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:TorchKM is an open-source library for kernel machines, including support vector machines, kernel logistic regression, and kernel quantile regression, with GPU acceleration. The library features a scikit-learn-style API and is designed to exploit GPU-friendly linear algebra, accelerating the full training and model-selection pipeline through intelligent reuse of matrix operations. Benchmarks show competitive predictive performance together with substantial speedups over standard baselines. Code and documentation are available at this https URL, and the package can be easily installed via PyPI.

[LG-65] Synthics: Synthetic Physics-like Datasets for Machine Learning

链接: https://arxiv.org/abs/2606.06724
作者: Jari Vepsäläinen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representative data is fundamental in machine learning, as limited data hinders generalisation. Collecting sufficient real-world samples is often infeasible. Synthetic data generation offers a practical solution, but only if the generated data faithfully reflects the structure of real observations. In this paper, a method for generating synthetic regression datasets that structurally resemble physics equations from a given equation corpus is presented. The approach uses a Bayesian Probabilistic Context-Free Grammar to capture the underlying algebraic structure of the corpus, from which novel equations are sampled. To ensure the generated inputs lie within a physically meaningful domain, the applicability domain is characterised for each equation through non-intrusive probing, also recovering inter-variable constraints. Input sampling further mimics realistic experimental conditions by drawing from random sub-ranges of the valid domain with mixed uniform and truncated normal distributions. The generated data is statistically validated against the Feynman equation corpus using Kolmogorov-Smirnov tests. The generated equations match the corpus on all of the eight studied structural features, compared to only two for an unsmoothed purely probabilistic grammar, demonstrating that the Bayesian prior is essential for structural fidelity given the size of the corpus. In a downstream hyperparameter-tuning task, a gradient-boosted regressor tuned on the synthetic data picks, on average, the 6th-best configuration out of 20 on real data, matching the result of tuning on real data itself and substantially outperforming random expression trees (10th) and noise (19th).

[LG-66] Flatland: The Adventures of Gradient Descent with Large Step Sizes ICML2026

链接: https://arxiv.org/abs/2606.06722
作者: Leonardo Galli,Curtis Fox,Wiebke Bartolomaeus,Mark Schmidt,Holger Rauhut
类目: Machine Learning (cs.LG)
*备注: Accepted for the International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:The training of neural networks often entails objective functions that are not globally L -smooth. For these functions, it is both theoretically and practically difficult to reply to the question: what is the largest possible step size that ensures the convergence of gradient descent (GD)? We address this longstanding open question in deep learning by providing a unifying definition of “large” step sizes that requires only local Lipschitz (or even Hölder) continuity of the gradient. We design first-order adaptive methods that provably yield large step sizes and show that they operate at the edge of stability (EoS) right from the start of the training. In particular, the loss decreases nonmonotonically and the product between the step size and sharpness, i.e., the largest eigenvalue of the Hessian, stays above the EoS threshold of 2 throughout training. Using our method, we are also able to minimize the sharpness all the way down to its global minimum. Contrary to expectation, we find that encountering globally-flat regions too early in the training may both slow down convergence and jeopardize the generalization ability of the network. Exploiting a self-stabilization argument, we allow GD to enter slightly sharper valleys and turn unsuccessful training runs into very successful ones.

[LG-67] owards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers

链接: https://arxiv.org/abs/2606.06687
作者: Su Wang,Mung Chiang,H. Vincent Poor
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注: Under review at IEEE/ACM Transactions on Networking

点击查看摘要

Abstract:We investigate cluster formation, involving the number and composition of clusters, in decentralized federated learning (FL) with heterogeneous machine learning (ML) optimizers. While clustering in centralized FL has enabled scalability and resource savings, its value and development in fully decentralized environments have yet to be explored. Optimizing cluster formation in such environments is challenging, especially due to the complex coupling between network graph structures, local data heterogeneity, and different local ML model optimizers. To address these challenges, we propose serverless semi-decentralized FL (SSD-FL), a methodology requiring no persistent server infrastructure. In SSD-FL, cluster formation occurs via a lightweight, one-time device-to-device (D2D) initialization phase, after which actual ML model training (alongside consensus and convergence processes) is fully serverless. Functionally, SSD-FL segments global rounds into intra-cluster and inter-cluster regimes, ensuring global convergence and consensus through novel “effective loss functions” that integrate device-specific ML optimizers with network graph-based regularization. Next, SSD-FL leverages the consensus gap via the Cheeger inequality to develop an iterative clustering algorithm evaluated against our derived convergence and consensus bounds, which incorporate a unique scoring metric to quantify data and optimizer heterogeneity across devices. Finally, experimental evaluation against three categories of decentralized FL methodologies validate that SSD-FL improves both convergence speeds and communication efficiency across various network graphs, datasets, and local optimizer regimes.

[LG-68] Spatiotemporal Imputation with Graph-Informed Flow Matching ICML2026

链接: https://arxiv.org/abs/2606.06682
作者: Zepeng Zhang,Aref Einizade,Jhony H. Giraldo,Olga Fink
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:Missing data is a common challenge in spatiotemporal systems, arising in applications such as air quality monitoring and urban traffic management. Traditional machine learning approaches, like recurrent and graph neural networks, rely on iterative propagation, which tends to accumulate errors over time and space. Recent diffusion-based methods mitigate error propagation but require iterative sampling and often depend on problem-agnostic Gaussian priors, limiting both efficiency and effectiveness. To address these limitations, we propose GiFlow, a Graph-Informed Flow Matching framework for spatiotemporal imputation. GiFlow replaces the typical Gaussian prior with a graph-informed prior constructed via spatiotemporal filtering of observable signals, which better aligns the source distribution to the target and thereby simplifies the generation trajectory. The flow field is parameterized by a hybrid vector field model that integrates spatial attention, temporal attention, and spatiotemporal propagation, enabling joint modeling of spatial and temporal dependencies. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed GiFlow outperforms the state-of-the-art approaches in spatiotemporal imputation. The code is available at this https URL.

[LG-69] Uncertainty-Aware LLM -Guided Policy Shaping for Sparse-Reward Reinforcement Learning

链接: https://arxiv.org/abs/2606.06673
作者: Ujjwal Bhatta,Utsabi Dangol,Sumaly Bajracharya,Rodrigue Rizk,KC Santosh
类目: Machine Learning (cs.LG)
*备注: Accepted to the 2026 IEEE Conference on Artificial Intelligence (IEEE CAI). 6 pages, 3 figures. Code available at: this https URL

点击查看摘要

Abstract:Sparse rewards and heterogeneous task sequences remain persistent challenges in Reinforcement Learning (RL), often resulting in slow convergence, weak generalization, and inefficient exploration. We propose Uncertainty-Aware LLM-Guided Policy Shaping (ULPS), a novel framework that integrates a calibrated Large Language Model (LLM) into the RL training loop to provide structured, uncertainty-modulated behavioral guidance. ULPS employs an A*-based oracle to synthesize optimal symbolic trajectories, which are used to fine-tune a BERT-based language model. During training, this model supplies action suggestions whose influence is conditioned on epistemic uncertainty estimated via Monte Carlo (MC) dropout. An entropy-based blending mechanism adaptively balances LLM guidance and the learned policy (via Proximal Policy Optimization, PPO), allowing the agent to prioritize reliable priors while preserving adaptability. We evaluate ULPS on the MiniGridUnlockPickup benchmark and observe consistent improvements in success rate, reward efficiency, and sample complexity over unguided, uncalibrated, and standard RL baselines. ULPS achieves more than 9% improvement in execution accuracy after fine-tuning, requires fewer environment interactions, and yields higher reward AUC. Our results demonstrate that integrating symbolic A* trajectories, pretrained language priors, and uncertainty-aware control offers a principled and effective approach to multi-task reinforcement learning in sparse-reward domains, with potential extensibility to partially observable and multi-agent settings.

[LG-70] Explainable Runtime Dependency Tracking for AI-RAN Conflict Monitoring

链接: https://arxiv.org/abs/2606.06663
作者: Christie Djidjev,Nicholas Kaminski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Future AI-integrated Radio Access Networks (AI-RAN) will combine open programmability with learning-enabled xApps, rApps, and control functions that act on shared parameters and key performance indicators (KPIs). For conflict monitoring, it is not enough to know which applications are deployed; the system must also know whether the parameter–KPI dependencies assumed by runtime diagnosis remain valid under the current operating regime. This paper studies a lightweight monitoring primitive for that purpose: tracking an interpretable dependency representation from streaming telemetry events. We represent active dependencies by a Boolean matrix and use Boolean matrix multiplication to check whether recent parameter-activity and KPI-response events are consistent with the current estimate. We propose a sliding-window inference procedure that reuses the estimate when it remains consistent and recomputes it when recent observations indicate structural change. The tracker is intended as an explainable signal for conflict diagnosis and slow-loop model refresh, not as an autonomous mitigation mechanism. Experiments on controlled Boolean event streams show efficient and accurate tracking under dependency changes and Boolean observation noise. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.06663 [cs.LG] (or arXiv:2606.06663v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06663 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-71] Capturing non-Markovian dynamics in non-equilibrium stochastic systems using flow matching

链接: https://arxiv.org/abs/2606.06658
作者: Bhargav Sriram Siddani,John B. Bell,Alejandro L. Garcia,Ishan Srivastava
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Computational Physics (physics.comp-ph)
*备注: 5 pages, 1 figure, Accepted to 2026 Conference on Physics and AI (PAI26)

点击查看摘要

Abstract:Hydrodynamic models of stochastic particle systems represented by coarse-grained stochastic partial differential equations (SPDE), such as the regularized Dean-Kawasaki (DK) equation, do not accurately capture the short-time system dynamics that is dominated by non-Markovian effects, and low particle density regimes where the distributions are highly non-Gaussian. We develop a generative flow matching method that directly models the probability distribution of fluxes from particle simulations that explicitly incorporates non-Markovian and non-Gaussian effects. As a demonstration, we use this method to simulate the Kramers first passage time problem for a system of non-interacting Brownian particles. We show the model accurately captures the short-time behavior and provides better predictions of the statistical moments of the number density when compared against the solution of the Markovian baseline, regularized DK equation.

[LG-72] he Identity Trap in EEG Foundation Models: A Diagnostic Audit

链接: https://arxiv.org/abs/2606.06647
作者: Jun-You Lin,Ying Choon Wu,Tzyy-Ping Jung
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 28 pages, 6 figures, 8 tables. Code available at this https URL

点击查看摘要

Abstract:Objective. EEG foundation models (FMs) report strong accuracy on clinical resting-state EEG. However, high accuracy under subject-disjoint cross-validation remains ambiguous: it can reflect a genuine clinical biomarker, or subject-identity features that correlate with the label. We name this the Identity Trap and ask whether it can be diagnosed at the representation level before fine-tuning. Approach. We propose FMScope, a frozen-representation protocol packaging five diagnostics: variance decomposition, subject-axis erasure, aperiodic 1/f ablation, layer-wise label probing, and within-subject direction consistency. We apply it to three pretrained FMs (LaBraM, CBraMod, REVE) across four datasets in a 2x2 layout: subject relation of label x presence of a consensus cross-subject EEG marker. Main results. (i) The Identity Trap is universal: frozen subject-variance is 13-89x a random null in 12/12 pairs, rising in all 12 under fine-tuning (+10 to +63 pp). This dominance is a removable linear axis: erasing it improves label decoding where the label varies within subject (+6 to +12 pp in primary cells; +4 to +27 pp across external cohorts). (ii) Aperiodic 1/f is one subject carrier: removing it drops the subject probe by 9-19 pp on LaBraM and CBraMod. REVE saturates subject identity without measurable aperiodic dependence. (iii) Fine-tuning amplifies label-variance only in cells with a literature-established cross-subject marker. Significance. The Identity Trap is a physically-grounded instance of shortcut learning: the preferred cue has a measurable physiological component, and subject-disjoint splitting alone cannot rule it out. FMScope separates gains reflecting a biological marker from those reflecting subject identity. Comments: 28 pages, 6 figures, 8 tables. Code available at this https URL Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2606.06647 [cs.LG] (or arXiv:2606.06647v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06647 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jun-You Lin [view email] [v1] Thu, 4 Jun 2026 18:54:34 UTC (380 KB)

[LG-73] Principles and Practice of Deep Representation Learning: or a Mathematical Theory of Memory

链接: https://arxiv.org/abs/2606.06624
作者: San Buchanan,Druv Pai,Peng Wang,Yi Ma
类目: Machine Learning (cs.LG)
*备注: version 2; TeX source and supplementary material at this https URL

点击查看摘要

Abstract:In the current era of deep learning and especially generative models, there is significant investment in training very large generative models. Thus far, such models have been “black boxes” that are difficult to understand in the sense that they have opaque internal mechanisms, leading to difficulties in interpretability, reliability, and control. Naturally, this lack of understanding has led to both hype and fear. This book is an attempt to “open the black box” and understand the mechanisms of large deep networks, through the perspective of representation learning, which is a major factor - arguably the single most important one - in the empirical power of deep learning models. A brief outline of this book is as follows. Chapter 1 will summarize the threads that underlie the whole text. Chapters 2, 3, 4, 5, and 6 will explain the design principles of modern neural network architectures through optimization and information theory, reducing the process of architecture development (long having been described as a sort of “alchemy”) to undergraduate-level linear algebra and calculus exercises once the underlying principles are introduced. Chapters 7 and 8 will discuss applications of these principles to solve problems in more paradigmatic ways, obtaining new methods and models which are efficient, interpretable, and controllable by design, and yet no less - sometimes even more - powerful than the black-box models they resemble. Chapter 9 will discuss potential future directions for deep learning, the role of representation learning, as well as some open problems. Comments: version 2; TeX source and supplementary material at this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.06624 [cs.LG] (or arXiv:2606.06624v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06624 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Gaussian Process Latent Factor Regression for Low-Data High-Dimensional Output Problems

链接: https://arxiv.org/abs/2606.06576
作者: Edward T. Stevenson,Eric T. Wolf,Mei Ting Mak,N. J. Mayne,Miles Cranmer
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (stat.ML)
*备注: 9 pages content + 22 pages appendix/references. Supporting code at this https URL

点击查看摘要

Abstract:In the sciences, regression tasks often require predicting high-dimensional outputs from few training examples. Multi-output Gaussian processes excel in low-data regimes but typically struggle with high-dimensional outputs. Compress-then-predict pipelines such as PCA-GP (principal component analysis plus Gaussian process regression) handle high dimensionality, but rely on bases optimized for reconstruction rather than prediction. To address this gap, we propose a model that represents each output as a linear-Gaussian decoding of a low-dimensional latent state drawn from a Gaussian process prior. By analytically marginalizing the decoder weights, we couple compression and prediction in a single objective that scales to high-dimensional outputs. We refer to this model as Gaussian process latent factor regression (GPLFR). We demonstrate GPLFR by building the first spatially resolved emulator of global climate models for rocky exoplanets.

[LG-75] Skip a Layer or Loop It? Learning Program-of-Layers in LLM s ICML2026

链接: https://arxiv.org/abs/2606.06574
作者: Ziyue Li,Yang Li,Tianyi Zhou
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026. Substantially extends arXiv:2507.07996 . Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM’s latent reasoning capacity.

[LG-76] Are you sure? A Comprehensive and Comprehensible Survey of Uncertainty Quantification in Symbolic Regression

链接: https://arxiv.org/abs/2606.06567
作者: Julia Reuter,Fabricio Olivetti de Franca
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic regression (SR) is a class of methods that systematically explore the space of mathematical functions to discover models that accurately capture the underlying relationships in a dataset. Despite recent advances in the field, a lack of support for uncertainty quantification (UQ) limits its adoption in real-world decision processes. In regression analysis, UQ provides important information about the model reliability, which can both help to avoid overfitting by accounting for uncertainty in the data, and provide insights for decision-making. This survey is the first to clearly address this issue, with the objective of introducing essential UQ concepts and reviewing the current literature on UQ in SR, which can be broadly organized into three research directions: frequentist, Bayesian, and model selection. Despite its importance, UQ in SR is still underexplored, which motivates further research into reliable UQ methods for SR.

[LG-77] Depth over Fidelity in Fixed-Budget Noisy Evolution Strategies ICML2026

链接: https://arxiv.org/abs/2606.06555
作者: Sichen Wang,Zhipeng Lu
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 28 pages, 16 figures, 7 tables, including appendices

点击查看摘要

Abstract:Noisy evolution strategies under fixed evaluation budgets face a depth-fidelity trade-off: spending evaluations to denoise intra-generation rankings reduces the number of distribution updates the optimizer can execute. We argue for depth over fidelity and propose probabilistic elite membership (PEM), which replaces hard rank-based weights in evolution strategies with conditional expected rank weights that integrate over ranking uncertainty. PEM preserves the conditional mean update while reducing conditional update dispersion, a Rao-Blackwellization of the noisy rank-based step. We instantiate PEM via residual bootstrapping (RB-PEM) with capped per-generation overhead, complemented by an adaptive probe-and-switch mechanism for low-noise regimes. Across the COCO bbob-noisy suite and external tasks including RL policy search and hyperparameter optimization, RB-PEM achieves consistent gains in high-misranking, budget-constrained settings.

[LG-78] Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

链接: https://arxiv.org/abs/2606.06546
作者: Tao Liu,Ye Lu,Ruohua Zhang,Siyu Song,Wentao Liu,Aimin Zhou,Hao Hao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly to long-tail pedagogical scenarios. We introduce Elmes*, an end-to-end framework for constructing, refining, and applying fine-grained scenario-specific rubrics. Elmes* combines a declarative multi-agent engine for teacher–student–judge interactions with SceneGen, a self-evolving module that co-optimizes evaluation criteria and test data from expert-defined pedagogical dimensions. Using Elmes*, we build Edu-330, covering 330 scenarios across 11 subjects, 3 grade bands, and 10 task types, with over 1,000 second-level indicators. Experiments on Edu-330 and four expert-authored gold-standard scenarios show that educational capability is multidimensional: top-tier LLMs differ mainly in creativity and values integration, knowledge-strong models may fail at Socratic scaffolding, and the education-specialized InnoSpark achieves the best human-evaluated average score. LLM judges preserve human-comparable rankings with much lower scoring variance, but exhibit judge-specific biases such as self-preference. Ablations show that expert-scored few-shot anchoring improves human–LLM alignment, while reasoning enforcement and greedy decoding are model-dependent. Elmes* thus provides scalable diagnostic infrastructure for pedagogically grounded LLM evaluation.

[LG-79] Architecturally Significant MLOps Guidelines for ML Model Integration and Deployment: a Gray Literature Review

链接: https://arxiv.org/abs/2606.06535
作者: Faezeh Amou Najafabad,Markus Haug,Keerthiga Rajenthiram,Justus Bogner,Ilias Gerostathopoulos
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: ECSA2026

点击查看摘要

Abstract:Context. Despite the growing adoption of Machine Learning Operations (MLOps), teams often approach MLOps projects in an ad hoc manner due to the lack of consolidated architectural guidance. The community would benefit from a reference that synthesizes knowledge to inform the architectural design of MLOps systems, especially regarding the integration and deployment of ML models. Objective. In response, our goal is to provide a comprehensive overview of architecturally significant guidelines for the integration and deployment of ML models in MLOps systems. Method. We conduct a gray literature review of 103 web sources to analyze state-of-practice knowledge on MLOps model integration and deployment. We then apply thematic analysis to synthesize these practices into recommended guidelines. Results. We contribute a collection of 25 architecturally significant MLOps guidelines for model integration and deployment, organized into five categories, and describe their impact on the overall system architecture. Conclusion. Our results serve as an overview of state-of-practice MLOps guidelines to support researchers and practitioners with the integration and deployment of ML models in their MLOps systems.

[LG-80] Ablation Study of Block Size Weight Precision and Scale Precision in NVFP4 Inference for Low-Power Edge-Efficient Neural Networks

链接: https://arxiv.org/abs/2606.06527
作者: Ovishake Sen,Venkata Nithin Kamineni,Daniel Lobo,Swarup Bhunia,Rickard Ewetz,Baibhab Chatterjee
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 6 Pages

点击查看摘要

Abstract:Energy-efficient edge inference requires reducing arithmetic cost, memory traffic, and hardware overhead. This paper presents an ablation-focused study of NVFP4 LUT-based inference for edge-efficient neural networks. The proposed NVLUT framework combines 4-bit NVFP4 activations, two-level scaling, LUT-based mantissa computation, voltage-scaled storage, and selective ECC protection. Multiplication is decomposed into sign, exponent, and mantissa paths, where sign uses XOR logic, exponent uses integer addition, and mantissa multiplication is replaced by compact LUT access. NVFP4 activations use FP4 data with an FP8 block scale and an FP32 tensor scale. Across six edge-efficient models, block-size ablation shows that B = 16 offers a practical accuracy/storage trade-off, requiring only 4.5078 bits per input for N = 4096. Weight-precision ablation shows that FP8 and FP16 weights provide only modest gains over FP4 weights under the same NVFP4 activation path. Compared with pure unscaled FP4, NVFP4 without retraining recovers substantial accuracy by restoring activation dynamic range, while NVFP4 with retraining achieves the best accuracy across models. Hardware analysis shows that NVLUT achieves up to 26.85x energy reduction over traditional LUTs with ECC plus voltage scaling and up to 22.85x under mixed-voltage operation. Area is reduced by up to 2.21x and 1.52x, respectively. These results demonstrate that NVFP4 two-level scaling with selective reliability protection enables robust, low-energy edge inference.

[LG-81] Amortized Neural Optimization for Pre-Layout Signal Integrity Design Space Exploration using Differentiable Surrogates

链接: https://arxiv.org/abs/2606.07463
作者: Julian Withöft,Werner John,Emre Ecik,Ralf Brüning,Jürgen Götze
类目: ignal Processing (eess.SP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 16 pages, 20 figures, 8 tables

点击查看摘要

Abstract:Pre-layout design space exploration (DSE) for high-speed signal integrity (SI) analysis is often limited by the computational cost of simulations and iterative optimization algorithms within modern electronic design automation (EDA) workflows. While machine learning surrogate models accelerate the simulation step, optimizing designs still requires utilizing iterative black-box search methods. This iterative nature scales poorly, making multi-corner sweeps computationally expensive. As a solution, this paper proposes amortized neural optimization (ANO) for pre-layout SI design. ANO entirely eliminates iterative black-box inference by utilizing fully differentiable neural network surrogate models. ANO extracts analytical gradients from the surrogate to train a global optimization policy. Instead of solving the optimization problem repeatedly at inference, the optimization process is learned offline and therefore amortized. Once the ANO policy is trained, it maps different channel contexts directly to near-optimal design parameters in a single deterministic forward pass. The efficiency and accuracy of the ANO framework are demonstrated based on three complex SI design scenarios, including DDR5 decision feedback equalization (DFE), 9-dimensional SerDes Tx/Rx co-equalization, and DDR3 DQS differential pair routing to optimize eye diagram metrics under intra-pair skew constraints. By trading roughly 10% in optimality compared to instance-specific black-box algorithms, it realizes speedups of three to four orders of magnitude. For a large-scale 320,000-instance multi-corner SerDes sweep optimization, ANO collapses what would have taken days of computation using iterative search algorithms into a single batched forward pass that completes in milliseconds. This transforms computationally expensive SI optimization into real-time and interactive pre-layout DSE.

[LG-82] he Proxy Benders Decomposition

链接: https://arxiv.org/abs/2606.07403
作者: Changkun Guan,El Mehdi Er Raqabi,Mathieu Tanneau,Pascal Van Hentenryck
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Benders decomposition is a fundamental framework for solving large-scale mixed-integer optimization problems with complicating variables that, when fixed, yield significantly easier subproblems. However, classical Benders decomposition repeatedly solves highly similar subproblems and often exhibits zigzagging behavior across iterations, leading to slow convergence in large-scale settings. Motivated by the repetitive structure and parametric nature of Benders subproblems, this paper introduces the proxy Benders decomposition (Proxy-BD), a new decomposition framework in which subproblem optimization is replaced by certified optimization proxies rather than repeated exact solves. The proposed proxy follows a self-supervised predict-project-and-complete mechanism that produces dual-feasible solutions for generating provably valid Benders cuts. The framework preserves the theoretical validity of the decomposition independently of prediction quality through a projection-and-completion certification layer. A formal characterization of proxy-induced cuts is established, and the framework naturally extends to modern decomposition schemes, including branch-and-Benders-cut algorithms. Computational experiments on large-scale facility location and network design problems demonstrate that Proxy-BD substantially reduces the computational effort of subproblems while maintaining near-optimal solution quality. On large-scale uncapacitated facility location instances up to 2000x2000, Proxy-BD achieves median optimality gaps below 0.5%, yields up to 161x median speedups, and reduces the number of generated cuts by more than 240x on the largest instances. The computational gains consistently increase with recourse complexity, indicating that proxy-based inference scales substantially more favorably than repeated exact subproblem optimization in large-scale decomposition settings. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2606.07403 [math.OC] (or arXiv:2606.07403v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2606.07403 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-83] Automatic Debiased and Invariant Counterfactual Generation under General Interventions

链接: https://arxiv.org/abs/2606.07399
作者: Raphael C Kim,Jingsen Zhu,Ramin Zabih,Michele Santacatterina
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative models for counterfactual outcomes have great potential to support decision-making under complex interventions, but existing approaches are limited by unstable estimation, poor generalization across environments, and bias from nuisance model misspecification. We introduce ADIGen, a framework for automatic, debiased, and invariant counterfactual generation under general interventions, including high-dimensional interventions and outcomes. ADIGen combines Riesz regression to avoid unstable density-ratio estimation, causal invariance to improve generalization under distribution shift, and orthogonal statistical learning to obtain doubly robust guarantees against nuisance model misspecification. We provide excess-risk bounds showing that ADIGen controls counterfactual risk under general interventions, with a product-bias nuisance remainder and an invariant risk bound across environments.

[LG-84] Unified Geometry-Guided ML-FTLE for Tracking Transient Chaos from Scalar Time Series

链接: https://arxiv.org/abs/2606.07385
作者: S. V. Manivelan,Andrei Velichko,I. Manimehan
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Preprint; 9 figures; submitted for peer review

点击查看摘要

Abstract:Detecting transient chaos from scalar observations without governing equations represents a fundamental challenge in nonlinear dynamics. We propose a geometry-guided machine learning framework that unifies predictive trajectory divergence with macroscopic attractor morphology to track abrupt regime shifts. The methodology extracts a local instability scale via out-of-sample k-nearest neighbor forecast errors to establish the ML-FTLE estimator, subsequently mapping this temporal divergence onto a structural closeness matrix derived from a minimal dictionary of Poincare occupancy grids. By employing partial least squares regression, we extract a latent geometric component calibrated directly to the empirical finite-time Lyapunov spectrum, yielding the Poincare-based geometric-guided FTLE. Validation against analytical QR-FTLE baselines confirms that fusing topological state spaces with predictive divergence systematically improves continuous transition tracking. The Structural Similarity Index optimally resolves gradual damping, while Hausdorff Distance exhibits extreme resilience during abrupt phase-space collapses. Furthermore, macroscopic spatial discretization acts as a robust topological regularizer against additive Gaussian noise, preserving deterministic signatures even at moderate signal thresholds. This equation-free framework provides a highly accurate, noise-resilient diagnostic for monitoring structural transitions in complex non-stationary systems.

[LG-85] On the conditional equivalence of phase retrieval algorithms

链接: https://arxiv.org/abs/2606.07257
作者: Jakob Schroeder,Andreas Döpp
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Phase retrieval - recovering a complex-valued field from intensity measurements - is typically solved using variants of the Gerchberg-Saxton (GS) algorithm, understood as alternating projections between measurement planes. Meanwhile, modern computational imaging increasingly relies on gradient-based optimization and automatic differentiation. Here we show that these two approaches are mathematically identical: the GS magnitude replacement step is exactly a unit gradient descent step on an amplitude least-squares loss. This equivalence enables seamless integration of classical phase retrieval with differentiable physics pipelines. We further identify two complementary probabilistic interpretations of this equivalence: globally, the amplitude loss is the negative log-likelihood under Gaussian amplitude noise; locally, each projection step arises as a Bayesian update with the propagated field as prior. The local view provides qualitative guidance for relaxation in iterative phase retrieval.

[LG-86] Deep Single-Index Fréchet Regression

链接: https://arxiv.org/abs/2606.06957
作者: Muqing Cui,Yidong Zhou,Su I Iao,Hans-Georg Müller
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting outputs that are located in non-Euclidean spaces, such as probability distributions, networks, and symmetric positive-definite matrices, is becoming increasingly important in modern data analysis, particularly when inputs are high-dimensional. We propose DeSI (Deep Single-Index Fréchet Regression), a semiparametric framework for regression with metric space-valued outputs and multivariate inputs that assumes a single-index structure for the conditional Fréchet mean. DeSI estimates an interpretable index direction, which quantifies the relative importance of inputs, using a deep neural network, and performs Fréchet regression along the resulting one-dimensional index in the target metric space. This structure mitigates the curse of dimensionality while retaining interpretability, which stands in contrast to standard deep neural networks. We establish theoretical guarantees for DeSI, including uniform approximation and convergence rates, and demonstrate its strong predictive performance through simulations on distributions, networks, and symmetric positive-definite matrices, as well as an application to compositional mood data from New Jersey.

[LG-87] Stability beyond Bounded Differences: Sharp Generalization Bounds under Finite L_p Moments

链接: https://arxiv.org/abs/2606.06855
作者: Qianqian Lei,Soham Bonnerjee,Yuefeng Han,Wei Biao Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:While algorithmic stability is a central tool for understanding generalization of learning algorithms, existing high-probability guarantees typically rely on uniform boundedness or sub-Gaussian/sub-Weibull tail assumptions, which can be overly restrictive for modern settings with heavy-tailed or unbounded losses. We develop a stability-based framework that requires only a finite L_p moment condition. Our first contribution is sharp concentration inequalities for functions of independent random variables under L_p constraints, extending McDiarmid’s bounded-differences techniques beyond the classical regime. Leveraging these results, we derive sharp high-probability generalization bounds across a range of learning paradigms, including empirical risk minimization, transductive regression, and meta-learning. These guarantees show that L_p stability suffices for robust generalization even when boundedness fails, substantially weakening the standard assumptions in the stability literature.

[LG-88] SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails INTERSPEECH2026

链接: https://arxiv.org/abs/2606.06837
作者: Vsevolod(V.)Kovalev,Pranay Manocha
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Scripted vs spontaneous speech detection is appealing for interview guardrails, but benchmark performance can be inflated by shortcuts tied to corpus identity, channel conditions, and recording artifacts rather than speaking style itself. We present SEAM, a shortcut-aware framework for real-time scriptedness detection that combines uniform preprocessing, seam-aware sampling, non-speech augmentation, and a compact DistilHuBERT backbone. With 8s windows, the model achieves 0.971 ± 0.004 ROC-AUC on an external interview-domain evaluation set. Removing the shortcut-prevention components improves internal held-out metrics but sharply reduces external performance, indicating shortcut learning. Post-training quantization reduces the model footprint to 41.8MB with little loss in external performance. The results demonstrate that robust real-time scriptedness detection depends not only on the backbone, but on shortcut-aware data design and evaluation. We release code and model checkpoints.

[LG-89] he Effect of Training Task Diversity on In-Context Learning through the Lens of Low-Dimensional Subspaces

链接: https://arxiv.org/abs/2606.06814
作者: Soo Min Kwon,Alec S. Xu,Can Yaras,Dogyoon Song,Laura Balzano,Qing Qu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The transformer’s emergent ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its underlying mechanisms. Existing works often study how training task diversity, defined either as the number of ICL training task vectors or as the number of function classes from which the task vectors are drawn, shapes both the learning dynamics and generalization capabilities of ICL. While both definitions have uncovered many interesting phenomena, many observations under the latter definition remain theoretically unexplained. This paper presents a minimal analytical model under which these phenomena provably emerge from the properties of the training data. By modeling the training task vectors as a mixture of low-rank Gaussians, we show how training task diversity, defined by the number of non-overlapping columns between subspaces that parameterize the covariance matrices, improves both the generalization and optimization trajectory of ICL with linear attention. In particular, we show that our model can explain (i) why training with task diversity shortens the ICL plateau and (ii) why ICL appears to achieve out-of-distribution generalization. We conclude by empirically demonstrating how our results extend to nonlinear transformers and nonlinear function classes. Overall, our work presents a tractable framework to unify existing observations.

[LG-90] Empirical Transfer Operators and Finite-Sample Change Detection for Noisy Expanding Interval Maps

链接: https://arxiv.org/abs/2606.06785
作者: Aparna Rajput
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 27 pages, 2 tables, 1 figure

点击查看摘要

Abstract:We study finite-sample change detection for one-dimensional noisy dynamical systems using partition-based empirical approximations of stationary behaviour. Given observations from an interval-valued process, we partition the state space, estimate a finite transition matrix from observed transitions between partition elements, and apply a small Doeblin-type regularisation to ensure a unique stationary distribution. From an initial reference segment, we compute a baseline empirical stationary distribution (\widehat\pi_0,\rho). For each later sliding window, we compute (\widehat\pi_t,\rho) and define the score [ S_t=|\widehat\pi_t,\rho-\widehat\pi_0,\rho|_1. ] Large values of (S_t) indicate a change in stationary behaviour relative to the baseline. The statistic detects changes in invariant density or stationary law, but not all possible changes in transition dynamics. Under explicit assumptions on empirical transition concentration, finite-state stationary distribution stability, partition approximation, regularisation bias, and noise stability, we derive a finite-sample bound for the empirical stationary density. The bound separates sampling error, regularisation bias, partition approximation error, and noise bias. We then obtain a single-window false-alarm guarantee and a sufficient detection condition when the invariant density changes by more than the estimation error. We illustrate the method on synthetic noisy beta-map change-point experiments. Comments: 27 pages, 2 tables, 1 figure Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS) MSC classes: 37A30, 37Hxx, 60J10, 62M10 Cite as: arXiv:2606.06785 [stat.ML] (or arXiv:2606.06785v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.06785 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-91] Reactivity-Informed Machine Learning for Performance Prediction and Design Space Exploration of Alkali-Activated Slag

链接: https://arxiv.org/abs/2606.06765
作者: Qiyao He,Zhanzhao Li,Kai Gong
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 68 pages, 14 figures, 2 tables

点击查看摘要

Abstract:Establishing quantitative relationships among mix design, raw material properties, curing conditions, and performance remains a long-standing challenge in cementitious materials, particularly for alkali-activated materials with variable precursor and activator chemistry. Here, we curated the largest literature-derived alkali-activated slag (AAS) dataset to date, comprising over 3100 compressive strength records, 155 chemically distinct ground granulated blast-furnace slags (GGBSs), and 24 attributes incorporating precursor chemistry, fineness, and reactivity. Multiple machine learning (ML) algorithms were benchmarked across progressively enriched feature scenarios, demonstrating that integrating GGBS compositions, fineness, curing conditions, and specimen geometry improves predictive performance. The average metal oxide dissociation energy (AMODE), a physically interpretable representation of precursor reactivity, provides a compact alternative descriptor to explicit oxide compositions while enabling comparable predictive performance. Model interpretation revealed physically consistent trends from heterogeneous data, including non-monotonic effects of Na2O dosage and silicate modulus, reduced predicted strength at higher water content and larger specimen size, and coupled oxide-level effects more coherently represented by AMODE than by individual oxide contents. Statistically constrained design space exploration reveals reactivity-dependent trade-offs among strength, embodied CO2 emissions, and cost. The design maps identify high-strength regions with substantially lower CO2 emissions than OPC-based references at similar cost. Overall, this work demonstrates how reactivity-informed ML can extract physically meaningful trends from heterogeneous AAS data and guide source-dependent binder design. The curated dataset is publicly accessible to support advances in cement and concrete research.

[LG-92] Probabilistic learning to perform pre-onset individualised prediction of disease severity: application to Veno Occlusive Disease

链接: https://arxiv.org/abs/2606.06516
作者: Dalia Chakrabarty,Kane Warrior,Chuqiao Zhang,Akash Bhojgaria,Joydeep Chakrabartty
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We advance a new probabilistic supervised learning approach that permits reliable, automated, and early individualised prediction of the severity with which a disease will develop in a prospective patient. The prediction capacity is illustrated via the pre-transplant prediction of the score of severity of Veno Occlusive Disease (or VOD) in the digital twin (DT) of the considered prospective patient, where this score parametrises the severity with which VOD will develop in this patient, after they undergo their Bone Marrow Transplant. The learning of the relationship between the pre-transplant variables, and a severity score variable is undertaken by modelling this relationship as a (random) function that is treated as a sample function of an adequately-chosen stochastic process. The parameters of this underlying process are learnt using a training dataset that is generated using the real-time evolution of retrospective patients in a cohort, with this training dataset subsequently augmented in size by a probabilistic inverse learning of the score of prospective patients. The augmented training set, then permits the learning of the function that capacitates - at the pre-transplant stage - automated prediction of the score of the severity of VOD that characterises the DT of a physical patient in their unique pre-transplant state. This score is subsequently fed back to the real prospective patient as the severity with which VOD will develop in them, after this patient undergoes their transplant. Such a score then permits the treating Haematologist-Oncologists to decide on the treatment regimen, which in this illustration reduces to deciding on treating the patient with Defibrotide. An AI facility is developed to undertake such automated prediction, with the physician inputting the data on the pre-transplant state that characterises the DT of the prospective patient under consideration.

[LG-93] Deployed trusted-node quantum key distribution over 300 km with a multi-core fiber access link

链接: https://arxiv.org/abs/2606.06107
作者: Martin Clason,Joakim Argillander,Didrik Bergström,Daniel Spegel-Lexne,Giulio Foletto,Ashraf El Hassan,Mohamed Bourennane,Onur Günlü,Katia Gallo,Rui Lin,Guilherme B. Xavier
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Optics (physics.optics)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Quantum key distribution (QKD) is increasingly considered for deployment in realistic communication networks, where long distances, heterogeneous fiber infrastructure, and coexistence with classical traffic present substantial challenges. Here, we demonstrate trusted-node QKD between Linköping University and the Stockholm hub of the Swedish national quantum communication infrastructure over 270 km of deployed single-mode fiber, extended by a 33 km multi-core fiber (MCF) segment emulating a metropolitan access link, for a total distance of 303 km. The two sub-links use commercial QKD systems whose receivers are interfaced with external superconducting nanowire single-photon detectors, enabling operation at losses beyond those supported by standard internal gated-mode detectors. We operate the link while actively switching the QKD channel between two MCF cores, with co-propagating Ethernet traffic and injected broadband optical noise in the other cores. The results demonstrate the integration of commercial QKD into demanding, dynamically reconfigurable fiber infrastructure relevant to future hybrid quantum-classical networks. Finally, using the generated secret keys, we illustrate how limited and time-varying QKD throughput affects one-time-pad-protected image transmission: image fidelity depends strongly on the available QKD-generated key budget and the choice of compression algorithm, highlighting application-level challenges for QKD-based encryption in realistic scenarios.