Arxiv今日论文 | 2026-06-15

本篇博文主要内容为 2026-06-15 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共77篇(Computation and Language (cs.CL))
人工智能共151篇(Artificial Intelligence (cs.AI))
计算机视觉共83篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共167篇(Machine Learning (cs.LG))
多智能体系统共10篇(Multiagent Systems (cs.MA))
信息检索共13篇(Information Retrieval (cs.IR))
人机交互共15篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

【速读】：该论文旨在解决合作式多目标多智能体强化学习（Cooperative Multi-Objective Multi-Agent Reinforcement Learning, MOMARL）中的协同决策问题，尤其关注在多个潜在冲突目标下，不同智能体因观测差异、角色分工及贡献不均所引发的内部协调难题。其核心挑战在于如何在多目标优化中实现智能体间互补性权衡（complementary trade-offs），以提升团队整体性能。解决方案的关键在于提出偏好协调的多智能体策略优化方法（Preference Coordinated Multi-agent Policy Optimization, PCMA），通过学习各智能体特有的偏好（agent-specific preferences），在保持个体目标差异化的同时促进团队层面的协同优化。理论分析表明，在特定条件下，偏好多样性可通过一阶改进分解机制实现团队整体性能提升；实验验证了PCMA在多个协作型多目标环境及实际交通控制场景中的有效性，显著提升了系统性能与多目标权衡的协调能力。

链接: https://arxiv.org/abs/2606.14693
作者: Pengxin Wang,Lihao Guo,Yi Xie,Bo Liu,Siyang Cao,Jingdi Chen
机构: University of Arizona(亚利桑那大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cooperative multi-objective multi-agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different observations, roles, and contributions. We propose Preference Coordinated Multi-agent Policy Optimization (PCMA), which learns coordinated agent-specific preferences to enable complementary trade-offs among agents. Theoretically, we formulate cooperative MOMARL as a team-optimal game and show that, under suitable conditions, preference diversity can induce team improvement through a first-order improvement decomposition. Experiments on multiple cooperative MOMA environments and a practical traffic-control scenario show that PCMA improves both performance and trade-off coordination.

[MA-1] Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

【速读】：该论文旨在解决多智能体强化学习（multi-agent reinforcement learning, MARL）中因全局安全无法由单一智能体单方面强制执行而引发的安全协调问题：即某一智能体动作的合法性依赖于其他智能体的行为动态。传统的去中心化屏蔽（decentralised shields）虽可在运行时保证安全，但纯粹分解式的权限分配常会排除仅通过协同才能实现的安全最优团队行为。其解决方案的关键在于设计一种基于线性时序逻辑安全片段（Linear Temporal Logic safety fragment, $\mathsf{LTL}_\mathsf{safe}$ ）的去中心化协同机制：所有智能体共享一个全局安全规范 $\phi$ ，并从一组局部 $\mathsf{LTL}_\mathsf{safe}$ 义务组合中选择，这些局部义务的合取可推导出全局规范 $\phi$ ；每个智能体可将其他智能体的局部义务视为假设，得益于整体契约元组的联合验证与投影能力，生成本地动作掩码。在学习阶段，采用非平稳多臂赌博机（non-stationary multi-armed bandit）从局部 $\mathsf{LTL}_\mathsf{safe}$ 义务库中选择优化团队奖励的义务组合，从而在不牺牲端到端安全性的前提下恢复团队最优安全行为。该方法在6个环境和15种算法变体上进行了评估，验证了其有效性。

链接: https://arxiv.org/abs/2606.14130
作者: Omar Adalat,Edwin Hamel-De le Court,Francesco Belardinelli
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent’s action may depend on the dynamics of other agents. Decentralised shields can enforce safety at runtime, but purely factorised permissions often exclude optimal team behaviour that is safe only through coordination. We study deterministic safety guarantees for agents trained and deployed under decentralised execution, recovering team-optimal safe behaviour without centralised runtime control. Agents have a shared global specification \phi in the safety fragment of Linear Temporal Logic ( \mathsfLTL_\mathsfsafe ), and select among tuples of local \mathsfLTL_\mathsfsafe obligations whose conjunction implies the global specification \phi . Each agent may rely on the other agents’ local obligations as assumptions because the whole contract tuple is certified simultaneously and allows projection into local action masks. At learning time, a non-stationary multi-armed bandit chooses among a library of local \mathsfLTL_\mathsfsafe obligations to select the tuple that optimises team reward, all without forgoing end-to-end safety. We evaluate the approach across 6 environments and 15 algorithmic variants.

[MA-2] Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents ICML2026

【速读】：该论文旨在解决生成式图形用户界面（GUI）代理在执行复杂任务时因视觉记忆使用不当而导致的可靠性问题。现有方法通过存储和检索历史交互中的全图截图以增强上下文感知，但其对不同故障类型的影响尚不明确。为此，论文提出一个涵盖四类故障的系统性分类体系：认知失败、视觉状态误解、隐藏操作盲视与定位错误，分别对应感知-推理-动作流程中的不同阶段。研究发现，直接使用全图记忆虽能缓解状态级错误，却会加剧动作级错误，并导致隐藏操作盲视与定位错误上升。针对此问题，论文提出一种基于动作的视觉记忆框架——行动锚定视觉记忆（Action-Grounded Visual Memory, AGMem），其核心在于仅存储与成功操作或恢复行为密切相关的局部图像区域，而非完整屏幕截图。在OSWorld基准上的实验表明，相较于全图记忆，AGMem可将任务成功率提升33.3%，验证了其作为高效视觉记忆表征的有效性。

链接: https://arxiv.org/abs/2606.14106
作者: Seoyoung Choi,Minseok Ko,Hyunseok Lee,Kunwoong Kim,Woomin Song,Chanseok Jeon,Jinwoo Shin
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, ICML 2026 WORKSHOP

点击查看摘要

Abstract:Graphical User Interface (GUI) agents are increasingly used to automate complex computer tasks across applications, websites, and operating systems. To improve their reliability, recent work has introduced experiential memory, where agents retrieve prior trajectories to guide decision-making in similar states. More recent approaches further extend this idea to visual memory by storing and retrieving screenshots from past interactions, providing agents with richer contextual information than text-only memories. However, the effect of visual memory in GUI agents remains insufficiently understood: it is unclear which failures visual memory mitigates, or which failures it exacerbates. To systematically analyze the effect of visual memory, we introduce a taxonomy of four GUI agent failures (i.e., cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error) that map to distinct stages of the perception-reasoning-action pipeline. We find that prepending full-image memory has a divergent effect on the failure distribution: it reduces state-level failures but worsens action-level ones, and increases hidden operation blindness and grounding error. Motivated by this finding, we propose Action-Grounded Visual Memory (AGMem), an action-grounded memory framework for GUI agents. The core idea of AGMem is to store image crops that capture the local GUI region closely related to a successful action or a recovery, rather than storing full screenshots. Experiments on OSWorld show that AGMem improves task success rates by 33.3 % over full-image memory. These results demonstrate that AGMem is an effective representation for visual memory in GUI agents.

[MA-3] When Plausible Is Not Realistic: Evaluating Human Mobility in LLM -Based Urban Simulation

【速读】：该论文旨在解决基于大语言模型（LLM）的生成式城市代理在城市模拟器中是否能够真实再现人类移动行为的问题，即区分其生成的移动叙事在语义上是否“合理”与在实证数据上是否“真实”之间的差距。其核心解决方案在于构建一个综合性的验证框架，通过引入移动规律（mobility laws）、时间节律（temporal rhythms）、网络基序（network motifs）、语义活动转换（semantic activity transitions）以及行为移动特征画像（behavioral mobility profiles）等多维度指标，对生成式代理的移动模式进行系统性评估。研究基于巴黎大区和上海的真实移动数据，对AgentSociety与CitySim两款主流模拟器进行了多维对比分析，结果表明：尽管这些模拟器能在高层语义层面捕捉部分活动分布，但在关键的空间-时间约束（如行程长度分布、起讫点流量、停留时间及转换动态）方面表现显著不足，且真实移动多样性在默认提示配置下不稳定，需依赖显式的个性化特征初始化。为支持可复现评估，研究还开源了可扩展的端到端基础设施，涵盖区域级地图生成、可观测性增强仿真、移动度量计算与交通模拟等功能模块。该工作强调了对基于LLM的城市模拟器进行严格实证验证的重要性，并为构建更真实、可复现的城市仿真系统提供了关键技术工具。

链接: https://arxiv.org/abs/2606.13835
作者: Gustavo H. Santos,Aline Carneiro Viana,Thiago H. Silva
机构: UTFPR(巴西联邦理工大学); Inria(法国国家信息与自动化研究所); University of Toronto(多伦多大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:LLM-based generative agents are increasingly used in urban simulators, yet it remains unclear whether they reproduce empirically realistic human mobility patterns or merely generate plausible mobility narratives. We introduce a validation framework for evaluating the mobility of generative agents of LLM-based urban simulators against real-world mobility data. For this, we use mobility laws, temporal rhythms, network motifs, semantic activity transitions, and behavioral mobility profiles. Using datasets from the Greater Paris region and Shanghai, we evaluate AgentSociety and CitySim across multiple dimensions of mobility realism. Our analysis reveals a substantial gap between narrative plausibility and empirical mobility realism. Although the simulators capture some high-level semantic activity distributions, they struggle to reproduce core spatial and temporal constraints, including realistic trip-length distributions, origin-destination flows, dwell times, and transition dynamics. We further observe that realistic mobility diversity is unstable across default prompting configurations and may require explicit profile-aware initialization. To support reproducible evaluation, we also contribute scalable and open LLM-driven infrastructure for regional-scale map generation, observability-enhanced simulation, mobility-metric computation, and traffic simulation. Our findings highlight the need for rigorous empirical validation of LLM-based urban simulators and provide practical tools for building more realistic and reproducible urban simulation systems.

[MA-4] Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response

【速读】：该论文旨在解决自主网络安全响应系统在实际部署中面临的“可操作性不足”问题，即现有仅基于奖励的多智能体强化学习（MARL）方法虽能提升安全收益，但缺乏对关键运维约束（如平均恢复时间MTTR、误报率、防火墙变更扰动等）的严格遵守，导致系统在真实场景中不可靠。其解决方案的关键在于提出一种基于安全合约图（safety-contract graph）的MARL框架——ACD³-GAT（自适应约束反事实决策，配备图注意力网络编码器），通过解耦模拟观测与可复用的操作预算、引入约束优化机制、图状态编码、以及反事实动作筛选等模块，实现对安全预算的显式建模与动态控制。进一步地，该框架融合条件风险价值（CVaR）尾部风险估计、对手信念状态建模及图反事实风险传播（G-CRP），使智能体在保持高安全性的同时具备更强的适应性与鲁棒性。实验结果表明，相较于无约束方法100%违反停机预算（均值成本311–430，预算为50），以及传统约束方法C-MAPPO-GAT（违规率0.3%，均值成本15.5），ACD³-GAT在维持较低违规率（13.8%）的前提下将均值成本降至48.2，处于安全合约前沿而非最保守合规点，显著提升了系统的实用性与可部署性。

链接: https://arxiv.org/abs/2606.13832
作者: Jose Luis Lima de Jesus Silva
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autonomous network-security response systems promise to reduce Security Operations Centre (SOC) reaction latency, but reward-only multi-agent reinforcement learning (MARL) can improve security reward while remaining non-deployable. We present a safety-contract graph MARL framework and instantiate it as ACD ^3 -GAT (Adaptive Constrained Counterfactual Decisioning with a Graph Attention Network encoder), an architecture that separates simulator observations from reusable operational budgets, constrained optimization, graph state encoding, and counterfactual action screening. We evaluate the method in CAGE Challenge 4, where agents operate under budgets for Mean Time to Recover (MTTR), false-positive response, and firewall change-management disruption. Across the benchmark, every unconstrained method violates the SOC downtime budget in 100% of evaluated episodes, with mean downtime proxy costs of 311-430 against a budget of 50. This complements prior CAGE Challenge 4 findings by showing that reward-only learning lacks operational discipline. Constrained MAPPO-GAT (C-MAPPO-GAT) isolates Lagrangian operational-cost control and budget-aware screening, while ACD ^3 -GAT adds budget context, CVaR tail-risk estimation, opponent-belief state, and Graph Counterfactual Risk Propagation (G-CRP). The replicated comparison includes three 200-episode seeds for IPPO, MAPPO-GAT, C-MAPPO-GAT, and ACD ^3 -GAT. C-MAPPO-GAT reduces downtime violation from 100% to 0.3% and mean downtime cost from 355.4 to 15.5 relative to MAPPO-GAT. ACD ^3 -GAT reduces mean downtime cost to 48.2 with a 13.8% violation rate, placing it on the safety-contract frontier rather than at the most conservative compliance point. Topology-seed and coupled adaptive Red-process stress tests preserve this contrast and show lower worst adaptive degradation for safety-constrained policies than reward-only MAPPO-GAT.

[MA-5] Large Language Models as Supervised Extraction Assistants: Lowering the Barrier to Documentation Standard Adoption in Agent -Based Modelling WWW

【速读】：该论文旨在解决代理基础建模（Agent-Based Modelling, ABM）中因文档编制工作量大且常被视为次要任务而导致的可重复性与透明性不足问题。尽管已存在如RAT-RS等用于数据使用报告的标准，但其实际采纳率仍较低。为此，本文探索利用大语言模型（Large Language Models, LLMs）来辅助并部分自动化文档生成过程，重点针对尚未被充分使用的严谨性与透明性报告标准（Rigour and Transparency Reporting Standard, RAT-RS），通过四类LLMs从一篇已发表的ABM论文中提取报告内容。研究评估了不同问题类型下的输出一致性与性能表现，发现LLMs在描述性任务上表现优于解释性或评价性任务，能够生成语义连贯的输出，但在准确性与深度理解方面仍存在局限。因此，论文提出一系列实用准则，以判断何时可依赖LLM辅助文档，何时必须引入人工审核，并呼吁在社区层面开展系统性研究，以推动ABM报告的严谨性提升与标准化实践的广泛采纳。

链接: https://arxiv.org/abs/2606.13749
作者: Peer-Olaf Siebers,Christopher Frantz
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 17 pages, accepted for publication at the Social Simulation Conference 2026, see this https URL

点击查看摘要

Abstract:Agent-Based Modelling (ABM) relies on clear documentation to ensure credibility and transparency. Although standards exist for documenting models (e.g. ODD), processes (e.g. TRACE, EABSS), and data use (e.g. RAT-RS), their adoption remains limited due to the effort required to produce documentation that is often treated as supplementary. This paper explores the use of Large Language Models (LLMs) to facilitate and partially automate such processes. We conduct a feasibility study focusing on the underused Rigour and Transparency Reporting Standard (RAT-RS), using four LLMs to extract reports from a published ABM paper. We assess consistency and performance across question types, finding that LLMs generate coherent outputs and perform more reliably on descriptive than on explanatory or evaluative tasks. While LLMs can improve reporting quality and consistency, they also exhibit notable limitations. We identify practical heuristics for when LLM-assisted documentation is reliable and when human oversight is needed and call for systematic community-level exploration to enhance rigour and adoption in ABM reporting.

[MA-6] winBI: An Agent ic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards

【速读】：该论文旨在解决业务智能（Business Intelligence, BI）系统中，用户在交互式仪表板操作与基于大语言模型（Large Language Model, LLM）的自然语言查询之间切换时，分析状态难以保持一致的问题。具体而言，当用户在直接修改过滤器、层级结构、度量指标或图表上下文的同时进行自然语言提问时，系统往往无法维持跨模态操作的一致性，导致分析过程断裂或错误。其解决方案的关键在于提出TwinBI——一种基于代理的数字孪生框架，通过将LLM驱动的代理系统与可执行的仪表板状态进行耦合，构建一个统一的、可追溯的分析状态。该框架利用统一的交互日志重建共享分析状态，实现了对话交互、仪表板操作、语义对齐与溯源追踪的深度融合，并通过暴露模式视图、SQL语句、操作日志及/insights命令等可解释性产物，支持基于状态的分析摘要生成。实验结果表明，在相同基础代理架构下，TwinBI将精确匹配准确率从43.3%提升至63.3%，部分得分准确率从48.3%提升至70.8%，超时率由40.0%显著降至10.0%；用户可用性研究亦显示，集成式仪表板-对话工作流提升了任务准确性、控制了认知负荷，并获得用户对状态感知交互机制的积极评价。因此，TwinBI的核心创新在于将可视化仪表板状态转化为丰富且可行动的上下文，从而同时增强代理的分析可靠性与用户的交互支持能力。

链接: https://arxiv.org/abs/2606.13731
作者: Jisoo Jang Wen-Syan Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Business intelligence (BI) increasingly combines dashboard interaction with LLM-based assistance, but these two modes often fall out of sync during multi-step analysis. As users switch between direct dashboard manipulation and natural-language queries, it becomes difficult to preserve a consistent analytical state across filters, hierarchies, metrics, and chart context. We present TwinBI, an agentic digital-twin framework that couples an LLM-based agent system with an executable BI dashboard state. TwinBI unifies conversational interaction, dashboard manipulation, semantic grounding, and provenance tracking through a shared analytical state reconstructed from a unified interaction log. It also exposes artifacts such as schema views, SQL, logs, and an /insights command for state-grounded analytical summaries. We evaluate TwinBI in two complementary ways. In a controlled A/B benchmark with the same backbone agent, TwinBI improves exact-match accuracy from 43.3% to 63.3%, partial-credit accuracy from 48.3% to 70.8%, and substantially reduces timeout rate from 40.0% to 10.0% relative to Dashboard alone. In a usability study, participants benefited from the integrated dashboard-and-chat workflow, with high task accuracy, moderate workload, and favorable ratings for state-aware interaction mechanisms. These results suggest that TwinBI improves both agent-level analytical reliability and user-facing analytical support by turning visible dashboard state into richer actionable context. Our dataset and source code are available at: this https URL

[MA-7] YeasierAg ent: Agent Agent ic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

【速读】：该论文旨在解决传统软件开发模式中“设备绑定”（device-coupled model）导致的应用孤立性与交互僵化问题，即应用程序通常依赖于特定平台和固定界面布局，难以实现跨平台灵活部署与自然化人机协作。其核心解决方案在于提出一种基于共生智能体（symbiotic agents）、叙事世界（narrative worlds）与场景感知交互（scene-aware interaction）的应用构建范式——YeaierAgent。该方案的关键在于：通过平台无关的交互单元（如智能体、场景、对话）替代传统的静态图形布局，实现跨平台、快速构建原生智能体应用；同时，在统一的体验沙盒中融合智能体的情感陪伴属性与实用工具执行能力，推动从孤立的、功能单一的聊天机器人向集成化、社会嵌入式的计算环境演进，从而正式确立“共生型智能体原生应用”（Symbiotic Agent-Native Applications）这一新类别。

链接: https://arxiv.org/abs/2606.13722
作者: Jory He
机构: Yeaier AI(易言科技)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper introduces YeasierAgent, an application-building paradigm based on symbiotic agents, narrative worlds, and scene-aware interaction. It challenges the conventional device-coupled model of software by redefining applications as collaborative spaces among users, agents, and worlds. We present a system architecture that achieves two primary contributions: (1) enabling the rapid, cross-platform construction of agent-native applications by utilizing platform-agnostic interactive units (agents, scenes, dialogue) rather than fixed graphical layouts; and (2) unifying the emotional companionship and practical tool execution attributes of intelligent agents within a single experiential sandbox. By integrating automated generation, user-created worlds, and spatial multi-agent collaboration, YeasierAgent formalizes the category of Symbiotic Agent-Native Applications, demonstrating a shift from isolated, tool-specific chatbots toward cohesive, socially embedded computational environments.

[MA-8] WorkBench Revisited: Workplace Agents Two Years On

【速读】：该论文旨在解决大模型在复杂工作流任务中性能与安全性之间的权衡问题，特别是在实际应用环境中因意外有害行为（如误发邮件）导致的不可逆风险。其核心挑战在于：尽管模型任务完成率持续提升，但错误行为仍可能引发严重后果。解决方案的关键在于揭示了在WorkBench基准测试中，前沿智能体（如Claude Opus 4.8）的性能与安全性并非此消彼长，而是呈现出协同提升趋势——任务完成率越高，意外有害行为发生率反而越低。此外，研究还发现，尽管多数高阶错误已被消除，但基础性失误（如向错误对象发送邮件）依然存在，构成主要安全风险；同时，开源权重模型的兴起显著降低了高性能模型的使用成本，而闭源前沿模型的成本则保持稳定，推动了技术普惠。论文通过更新基准数据集、代码质量优化及对2024至2026年智能体演进的系统分析，为评估和改进生成式AI（Generative AI）在真实场景中的可靠性提供了关键依据。

链接: https://arxiv.org/abs/2606.13715
作者: Olly Styles
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 8 pages, 3 figures. Follow-up to arXiv:2405.00823

点击查看摘要

Abstract:The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.

[MA-9] AGORA: Can Deliberation and Governance Gates Absorb Participation Bias in Transit Planning ?

【速读】：该论文旨在解决城市公共交通网络设计中因公众参与主体构成不均衡所引发的决策偏差问题，即当前实践中通过公开听证会收集的反馈往往来自自选参会者，导致参与者构成成为不可控的变量，进而影响规划结果的公平性与稳健性。其解决方案的关键在于提出AGORA框架，通过固定网络结构、需求模式与求解器，系统性地操控会议组成（由利益相关方代理模拟）、引入结构化协商机制以及设置治理门槛（governance gates），从而在可控条件下探究不同参与者组合对决策结果的影响。研究发现：（i）尽管总体优化结果对组成变化不敏感，但在尾部风险和公平性差异方面，代表性抽样仍优于偏倚组成；（ii）若缺乏结构化协商，参与者构成对结果无显著影响，表明协商过程是“谁参会”影响结果的核心机制；（iii）治理门槛可压缩跨群体差异，但其有效性依赖于具体案例的校准，如Mumford0场景显示低接受度时需个性化调整阈值。该研究将参与偏差从不可控的输入因素重构为可设计的流程问题，证明即使无法保证代表性的参会者，通过精心设计的协商机制与治理规则，仍可显著降低决策结果对现场人员构成的依赖性。

链接: https://arxiv.org/abs/2606.13696
作者: Jung-Hoon Cho,Cathy Wu
机构: Massachusetts Institute of Technology (麻省理工学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Transit network design depends not only on the optimization algorithm but also on who shows up to the public hearing. Current practice often collects one-directional comments from self-selected attendees, leaving participant mix as an uncontrolled source of outcome variation. We present AGORA, a framework that holds the network, demand, and solver fixed while systematically varying meeting composition through stakeholder agents, structured deliberation, and governance gates. Across two standard benchmark networks at different scales, we find that (i) aggregate outcomes vary little across compositions, but on tail risk and fairness disparity, representative sampling still tends to outperform skewed compositions; (ii) without deliberation, composition produces no variation at all, showing that deliberation is the mechanism through which who attends affects outcomes; and (iii) governance gates compress cross-profile variance without shifting the average outcome on Mandl, but low acceptance on Mumford0 shows thresholds require instance-specific calibration. These findings reframe participation bias from an uncontrollable input to a process-design problem: even without guaranteed representative attendance, well-structured deliberation and governance criteria can substantially reduce how much outcomes depend on who is in the room.

自然语言处理

[NLP-0] Gaze Heads: How VLMs Look at What They Describe

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Model, VLM）在图像描述任务中内部工作机制不明确的问题，尤其是其如何在生成描述时动态选择并聚焦于图像中的特定区域。其核心发现是：在语言模型主干中存在一组被称为“凝视头”（gaze heads）的注意力头，这些头通过追踪当前正在描述的图像区域来实现语义对齐。解决方案的关键在于利用少量前向传播计算出的简单相关性分数，识别出这些凝视头，并通过仅对前100个凝视头施加注意力掩码干预（attention-mask intervention），即可在83.1%的准确率下将模型输出精确引导至任意指定的漫画面板。这种干预不仅有效且高效（仅涉及不足9%的注意力头），还能实现连续控制——在生成过程中切换凝视目标可使模型在数个词元内完成当前面板描述并转向新区域。该机制在不同规模（2B至32B参数）的模型及多种VLM架构中均具可复现性，且无需微调即可在自然图像（如COCO数据集）上实现区域定向描述。这表明，基于机械分析识别出的可操作干预点可作为推理阶段的实用控制杠杆，实现对多模态模型行为的精准调控。

链接: https://arxiv.org/abs/2606.14703
作者: Rohit Gandikota,David Bau
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model’s answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at this https URL

[NLP-1] ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

【速读】：该论文旨在解决医疗多模态大语言模型（Medical Multimodal Large Language Models, MLLMs）在临床决策支持中因推理过程中的幻觉（hallucination）导致可信度下降的问题。现有医疗幻觉评估基准主要关注数据收集，却忽视了幻觉在推理链条中的具体来源。研究发现，幻觉成因具有样本间异质性，可能源于视觉误识别、医学知识错误回忆或推理整合缺陷。为实现幻觉的层级溯源诊断，论文提出ClinHallu——一个面向医疗MLLM推理过程的分阶段幻觉诊断基准。ClinHallu包含7,031个经验证的实例，每个实例均配有结构化的推理轨迹，分解为视觉识别（Visual Recognition）、知识回忆（Knowledge Recall）和推理整合（Reasoning Integration）三个阶段，并通过阶段替换干预手段量化修正特定阶段对最终答案的影响。实验表明，基于推理轨迹监督的微调可有效降低各阶段的幻觉率。因此，该研究的关键在于构建一个细粒度的幻觉诊断框架，实现对医疗MLLM推理失败的精准定位与缓解。

链接: https://arxiv.org/abs/2606.14697
作者: Sicheng Yang,Hangjie Yuan,Wenjun Zhang,Jinwang Wang,Yichen Qian,Weihua Chen,Fan Wang,Lei Zhu
机构: DAMO Academy, Alibaba Group; Hupan Lab; The Hong Kong University of Science and Technology (Guangzhou); Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code and datasets: this https URL

点击查看摘要

Abstract:Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at this https URL.

[NLP-2] Persona-Pruner: Sculpting Lightweight Models for Role-Playing ICML2026

【速读】：该论文旨在解决生成式角色扮演对话系统在实际应用中因计算成本过高而面临的效率瓶颈问题，尤其是在大规模多角色非玩家角色（NPC）并发交互的生态系统中，为每个角色部署一个完整的通用语言模型（LLM）导致资源浪费。其核心挑战在于：尽管角色身份特征仅占模型整体能力的一小部分，但现有方法在对模型进行剪枝时缺乏针对性，往往将角色关键特征与冗余知识一并去除，从而严重损害角色扮演性能。为此，论文提出Persona-Pruner框架，其关键创新在于通过从单一角色描述中识别并提取出与特定人格相关的子网络（persona-specific sub-networks），实现对模型的精细化裁剪，而非盲目全局剪枝。该方法在保留角色扮演一致性与风格化表现的同时，显著降低模型规模，并在RoleBench评测中将性能下降幅度相较于最强基线减少高达93.8%（以LLM-as-a-judge得分衡量），同时维持了模型的通用语言理解能力。

链接: https://arxiv.org/abs/2606.14695
作者: Jinsu Kim,Jihoon Tack,Noah Lee,Jongheon Jeong
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 25 pages; ICML 2026; Code is available at this https URL

点击查看摘要

Abstract:Language Models (LMs) have shown remarkable potential as role-playing chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting simultaneously) exposes a critical inefficiency due to the excessive computational cost. In this paper, we question the necessity of dedicating a full, generalist model to a single persona, hypothesizing that a specific character identity relies on only a fraction of the model’s total capacity. We observe that naively pruning LMs often severely degrades the role-playing performance for a specific persona; it does not distinguish between redundant knowledge and essential character traits. We propose Persona-Pruner, a framework that sculpts a lightweight role-playing model by isolating persona-specific sub-networks from a single description. Our experiments consistently show that Persona-Pruner preserves role-playing performance substantially more effectively than existing state-of-the-art LLM pruning techniques, reducing the performance drop from the dense model by up to 93.8% over the strongest baseline on RoleBench in LLM-as-a-judge score, while still maintaining general LLM capabilities. Code is available at this https URL.

[NLP-3] AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

【速读】：该论文旨在解决大模型在动态输入场景下（如音视频流）的推理能力不足问题，传统“读取-思考”范式依赖静态上下文，难以适应信息持续到达的实时环境。其核心挑战在于如何在流式输入过程中实现灵活、自适应的推理决策，包括何时思考、思考深度及计算资源分配。解决方案的关键是提出AdaSR（Adaptive Streaming Reasoning）框架，该框架支持在输入流中进行动态推理，并在流结束时进行最终审议，通过学习动态调整推理策略以优化计算效率与准确性之间的权衡。其核心创新在于引入分层相对策略优化（Hierarchical Relative Policy Optimization, HRPO），将策略优化分解为流式推理与深度推理两个阶段，实现更细粒度的优势值分配，而非对所有标记统一分配序列级优势；同时融合格式、准确率与自适应思考奖励，确保有效推理协议、保持任务性能并促进延迟感知的计算资源调度。实验表明，相较于监督微调基线，AdaSR在推理精度、计算效率和流式延迟之间实现了更优平衡。

链接: https://arxiv.org/abs/2606.14694
作者: Junlong Tong,Wenqi Xu,Yingqi Fan,Anhao Zhao,Xuan Lu,Yang Tan,Xiaoyu Shen
机构: Eastern Institute of Technology, Ningbo; Shanghai Jiao Tong University; The Hong Kong Polytechnic University; Southeast University; Xi’an Jiaotong-Liverpool University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at this https URL.

[NLP-4] CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment EMNLP2026

【速读】：该论文旨在解决生成式视觉-语言模型（Large Vision-Language Models, LVLMs）在基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）框架下存在的“思维-答案语义不一致”问题，即模型推理过程与最终答案之间在语义上存在脱节。尽管现有方法致力于提升推理轨迹的视觉覆盖度并缓解视觉幻觉，但对推理逻辑与最终输出之间的一致性关注不足，导致模型虽能生成看似合理的推理链，却可能与真实答案存在语义偏差。针对此问题，论文提出了一种轻量级、可插拔的解决方案——一致性导向推理对齐（Consistency-Oriented Reasoning Alignment, CORA），其核心在于引入一个专门设计的语义一致性奖励模型，以显式建模推理过程与最终答案之间的语义一致性，并结合混合奖励优势分解（Hybrid Reward Advantage Splitting, HRAS）机制，实现任务性能优化与一致性目标之间的稳定协同。实验结果表明，CORA在多个主流多模态推理基准和大型视觉-语言模型上均显著提升了任务表现，同时有效缓解了思维-答案不一致现象，使推理过程更加可信且忠实于真实意图。

链接: https://arxiv.org/abs/2606.14691
作者: Jiayue Cao,Zhicong Lu,Xuehan Sun,Wei Jia,Hongling Zheng,Changyuan Tian,Zichuan Lin,Wenqian Lv,Nayu Liu
机构: University of Chinese Academy of Sciences; Wuhan University; Tsinghua University; Tianjin University
类目: Computation and Language (cs.CL)
备注: Submitted to EMNLP 2026

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answer. In this paper, we delve into thinking-answer inconsistency in RLVR for large vision-language models (LVLMs), showing thorough analyses of rollouts collected throughout Group Relative Policy Optimization (GRPO) training process and post-RLVR evaluation outputs that this issue persists during training and remains present during inference. Motivated by the analysis, we propose Consistency-Oriented Reasoning Alignment (CORA), which introduces thinking-answer semantic consistency into RLVR through a lightweight plug-and-play consistency reward model, and further incorporates Hybrid Reward Advantage Splitting (HRAS) to stably coordinate task and consistency optimization. Extensive experiments across representative multimodal reasoning benchmarks and mainstream LVLMs show that CORA improves task performance while effectively mitigating thinking-answer inconsistency, leading to more faithful reasoning traces.

[NLP-5] Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit

【速读】：该论文旨在解决生成式人工智能（Generative AI）在形式化数学生成中面临的“价值瓶颈”问题：尽管当前AI系统能够以大规模生成可被证明检查器（proof checker）验证的形式化数学内容，但其生成结果中真正具有数学价值的命题（valuable statements）比例极低，而如何有效识别并生成这些高价值命题已成为制约系统实用性的核心挑战。其解决方案的关键在于构建一个基于嵌套语言生成极限的理论模型，将有价值的语言集合 $ H \subseteq F $（其中 $ F $ 为可验证的形式语言）视为通过对抗性枚举核心集合 $ C \subseteq H $（即已有文献）所揭示的隐藏结构，并引入“真值密度” $ \alpha $ 来刻画有价值命题在总生成中的占比。研究揭示了四个关键结论：首先，验证器本身不具“审美”能力，但在无验证器的模型下，可生成的集合类仍由安格鲁因（Angluin）条件逐纤维刻画；其次，验证器能确保可靠覆盖——在保证仅输出有效陈述的前提下，可覆盖所有未见的有价值命题，这一能力在无验证器时不可实现，且将不可避免的错误从“虚假”转移至“平凡”；第三，也是最核心的发现，存在一个关于“紧致家族”生成器的尖锐二分法：若生成器仅允许有限数量的平凡命题，则最优覆盖率可达 $ \alpha/2 $；而一旦允许无限数量的平凡命题（即使其密度趋于零），最优覆盖率将跃升至 $ 1 - \alpha/2 $，且两者均可达到，此跃迁取决于平凡命题的数量而非速率，而差距 $ 1 - \alpha $ 正对应于未记录的价值量；第四，该双模式在数学压缩模型中均可实现。最终表明，完美验证器无法替代人类对价值的判断——持续生成大量正确但无价值的命题并非工程缺陷，而是必然结果：要覆盖未被记录的有价值数学，必须依赖一个无限但渐近可忽略的已认证平凡命题流。

链接: https://arxiv.org/abs/2606.14688
作者: Xiaoyu Li,Andi Han,Dai Shi,Zheng Gao,Jiaojiao Jiang,Junbin Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:AI systems coupled to proof assistants now generate formal mathematics at scale, and the gap between what a checker can verify and what a mathematician would value has become the binding constraint. We model the generation of valuable mathematics as nested language generation in the limit: a verifiable formal language F , accessed through a membership oracle (the proof checker), contains an unknown valuable language H \in \mathcalH revealed only through an adversarial enumeration of a core C \subseteq H of exact density \alpha (the literature). Every output is valuable ( \in H ), trivial ( \in F \setminus H ), or a hallucination ( \notin F ). We settle four questions. First, the verifier is not taste: the collections admitting generation with breadth are exactly those of the oracle-free model, characterized fiber-wise by Angluin’s condition. Second, the verifier does buy sound coverage, covering all unseen valuable statements while asserting only valid ones: possible with it, impossible without it; it relocates unavoidable errors from false to trivial. Third, and centrally, a sharp dichotomy on the tight family: generators emitting finitely many trivia achieve optimal coverage \alpha/2 , while any infinite trivia allowance, even at vanishing rate, jumps the optimum to 1-\alpha/2 (both tight, for cores presented as the candidate intersection), and one generator attains both ends. The transition is in trivia count, not rate; the gap 1-\alpha is the unrecorded mass. Fourth, both regimes instantiate in a compression model of mathematics. A perfect verifier cannot substitute for taste: the unbounded stream of correct-but-worthless statements is not an engineering accident but a provable necessity, since covering unrecorded valuable mathematics requires an infinite, but asymptotically negligible, stream of certified trivia.

[NLP-6] AgentS pec: Understanding Embodied Agent Scaffolds Through Controlled Composition

【速读】：该论文旨在解决当前大型语言模型（LLM）智能体系统中模块化程度低、组件耦合紧密的问题，即现有智能体架构通常以紧耦合的流水线形式构建，导致难以分离各模块的贡献、评估不同设计的优劣，以及理解模块间交互对智能体行为的影响。其解决方案的关键在于提出AgentSpec——一个模块化的规范框架，将具身智能体表示为具有标准化接口的可复用策略组件的类型化组合。该框架统一了感知、记忆、推理、反思、动作执行及可选学习等模块的接口，支持在受控条件下灵活替换与重组组件。通过在DeliveryBench、ALFRED、MiniGrid和RoboTHOR等多个基准上应用该框架，研究发现智能体性能主要受架构兼容性与模块间交互效应的支配，而非单一模块的能力；具体表现为：结构化多粒度记忆有助于长时程状态追踪，推理与记忆在不同环境中的交互呈现非均匀特性，反思机制在纠错能力与计算成本之间存在权衡，而强化学习训练的策略在部署时的架构结构优化下表现最佳。AgentSpec为可组合式LLM智能体的研究、比较与设计提供了可控的基础。

链接: https://arxiv.org/abs/2606.14674
作者: Jixuan Chen,Jianzhi Shen,Haoqiang Kang,Zhi Hong,Qingyi Jiang,Soham Bose,Yiming Zhang,Leon Leng,Amit Vyas,Lingjun Mao,Siru Ouyang,Kun Zhou,Lianhui Qin
机构: University of California, San Diego(加州大学圣地亚哥分校); Johns Hopkins University(约翰霍普金斯大学); University of Washington(华盛顿大学); University of Illinois Urbana-Champaign(伊利诺伊大学厄本那-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior. We introduce AgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces. AgentSpec standardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement-learning modules across model backbones. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength. In particular, structured multi-granularity memory improves long-horizon state tracking, reasoning and memory interact non-uniformly across environments, reflection trades off correction and cost, and RL-trained policies compose best when optimized with deployment-time scaffold structure. AgentSpec provides a controlled foundation for studying, comparing, and designing composable LLM agents. Our code, baselines and interactive playground are publicly available at this https URL.

[NLP-7] owards Direct Latent-Space Synthesis for Parallel Branches in LLM -Agent Workflows

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的智能体系统在执行多分支并行任务时存在的效率瓶颈问题。现有系统通常采用串行文本拼接方式合并多个并行分支的输出，导致无法保留并行工作流的结构特性，并引入冗余的预填充（prefill）计算开销。其核心解决方案是提出一种即插即用的并行合成框架——Parallel-Synthesis，其关键在于允许合成器直接消费由并行工作代理生成的键值缓存（KV Cache），从而实现对非序列化缓存接口的高效生成。该框架包含两个核心组件：一是缓存映射器（cache mapper），用于校准各独立分支生成的缓存；二是经过微调的合成器适配器（synthesizer adapter），支持从非顺序缓存中进行推理生成。通过引入包含并行缓存上下文、跨缓存聚合及标准文本拼接式合成的推理行为蒸馏的数据训练，Parallel-Synthesis在九个下游数据集（涵盖数学、科学问答、代码生成、GAIA以及多智能体数据库诊断）上，在七个数据集上达到或超越传统文本拼接方法的表现，且在其余两个数据集上也保持接近性能。此外，该方法将首次生成时间（time-to-first-token）降低了2.5至11倍，验证了基于缓存的直接合成作为更原生、高效的并行智能体分支整合接口的可行性。

链接: https://arxiv.org/abs/2606.14672
作者: Shikun Liu,Mufei Li,Dongqi Fu,Haoyu Wang,Yinglong Xia,Hong Li,Hong Yan,Pan Li
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

[NLP-8] Abstracting Cross-Domain Action Sequences into Interpretable Workflows

【速读】：该论文旨在解决低层次、高噪声的数字应用交互日志（如浏览器操作记录、MOOC学习行为等）难以揭示用户真实工作流程与行为模式的问题，传统深度学习方法在处理此类数据时易受噪声干扰且泛化能力有限。其核心解决方案是提出WorkflowView框架，利用大语言模型（Large Language Models, LLMs）对原始细粒度操作序列进行抽象，将其映射为高层级、语义可解释的工作流活动。该方法的关键在于借助LLM强大的上下文理解与归纳能力，实现跨应用场景的零样本（zero-shot）任务描述重建、少样本（few-shot）学生辍学预测以及隐私保护下的AI工具集成分析，展现出优异的准确性与泛化性能。实验结果表明，该框架在不同领域中均能有效提取有意义的行为洞察，为基于真实用户交互的数字产品优化提供了高效、鲁棒的新路径。同时，论文还探讨了将LLM推理嵌入日志基础设施中的实际挑战，包括计算效率与用户隐私保护等问题。

链接: https://arxiv.org/abs/2606.14654
作者: Gaurav Verma,Scott Counts
机构: Microsoft Corporation (微软公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: preprint; 9 pages, 5 figures

点击查看摘要

Abstract:Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people’s work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications. To address this limitation, we introduce WorkflowView, a framework that uses large language models (LLMs) to abstract low-level action sequences into high-level activities. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: (a) zero-shot task description reconstruction from browser logs (achieving high semantic similarity, \mu_sim = 0.91 ), (b) few-shot student dropout prediction using MOOC interaction logs (reaching weighted F_1 = 0.90 with only five few-shot examples), and © anonymized, privacy-preserving analysis of AI tool integration within document workflows in Microsoft Word. Our work demonstrates that LLM-based abstraction is a robust and efficient path forward for transforming low-level behavioral data into high-level, interpretable, and actionable insights. We also discuss practical considerations for deploying LLM-based inferences within logging infrastructures, including computational efficiency and user privacy.

[NLP-9] Characterizing Cultural Localization in AI-Generated Stories ACL2026

【速读】：该论文旨在解决生成式内容在跨文化语境中是否存在文化本地化程度不足的问题，尤其关注故事生成中文化标记与叙事结构的本地化水平。其核心挑战在于区分“模板化本地化”（templated localization）与“整体性本地化”（holistic localization）：前者仅通过替换名称、地点等文化标记实现表面本地化，而后者则涉及情节、价值观和主题的深层调整。论文提出的解决方案关键在于构建一种量化评估方法——通过识别区分不同国籍故事的关键词汇（lexical tokens），并移除这些词汇后分析剩余叙事文本的相似性。研究发现，在193个国籍、125个主题下由五个模型生成的故事中，仅有9%-17%的词汇导致国籍间差异，且去除这些词汇后的剩余叙事包含重复的多词序列，表明存在一个共享的、文化中立的叙事模板。此外，通过对文化标记进行刻板印象性和冒犯性评估，发现来自19个全球南方国家的文化标记平均具有较高冒犯性，揭示了当前生成式AI在文化代表性上的系统性偏见。

链接: https://arxiv.org/abs/2606.14626
作者: Shaily Bhatt,Supriti Vijay,Jeremiah Milbauer,Fernando Diaz
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP) Co-located with ACL 2026, San Diego, USA (non-archival)

点击查看摘要

Abstract:The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, including stories. Cultural localization in stories often occurs through either templated localization – the use of cultural markers (e.g., names, locations) in a generic narrative – or holistic localization – the variation of plots, values, and themes, in addition to cultural markers. We propose a method to measure the degree to which content was generated through templated localization. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset (9-17%) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi-word sequences, suggesting the presence of a shared culturally-agnostic narrative template. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive.

[NLP-10] LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

【速读】：该论文旨在解决大语言模型（LLM）在多用户在线群聊环境中识别并适应隐含本地社交规范（local social norm）的能力不足问题。现有研究普遍忽视了群聊中未明言但广泛存在的局部对话惯例，而这些规范对有效社交互动至关重要。为此，作者提出了LoSoNA基准测试框架，通过设计包含隐含规范的群聊对话片段，要求目标模型基于前文推断规范，并在关键提问句中作出符合该规范的回应，从而检验其社会适应能力。解决方案的关键在于构建一个能精准评估模型从历史对话中推断隐性社交规则并即时应用的能力的评测机制，同时对比不同提示策略（如显式强调规范意识）对模型表现的影响，揭示当前前沿开放模型在社会认知层面的局限性与潜力。

链接: https://arxiv.org/abs/2606.14600
作者: Mateusz Winiarek,Maksymilian Bilski,Mateusz Jacniacki
机构: Humalike Research (Humalike 研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce LoSoNA, a benchmark for local social norm adaptation in multi-party chat. Each scenario gives a subject model a curated group-chat transcript in which non-subject participants demonstrate a hidden local norm, followed by a final elicitor turn that forces a response revealing whether the subject has inferred that norm. We evaluate eight frontier and open-weight models under four prompting conditions that vary how explicitly the model is told to treat the prior conversation as evidence for how it should answer. Naive prompting remains limited for most models; explicit norm-aware prompting helps unevenly, with Gemini 3.1 Pro reaching 84.2% and Claude Fable 5 reaching 81.6% , while several other models show small gains or regressions. LoSoNA contributes to recent calls for evaluating LLM social capabilities by testing whether models can infer local conversational norms from precedent and use them in a one-turn group-chat response.

[NLP-11] Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

【速读】：该论文旨在解决跨领域中识别具有说服力的修辞线索（rhetorical cues）这一关键问题，其应用场景涵盖信息操纵检测、AI安全提升以及公共健康传播等。针对现有方法缺乏统一理论框架与可解释性的问题，研究提出了一种名为说服力指数（Persuasion Index, PI）的15维分类体系，该体系基于心理学与传播学中的说服理论构建，并采用由55个子特征组成的透明化实现方式，这些子特征来源于词典和基于规则的检测器。其解决方案的关键在于：构建了一个模块化且理论驱动的分类体系，允许在不破坏整体结构的前提下替换个别检测器；通过在四个不同领域、风格和评估指标的公开数据集上验证，证明了PI能够提供一个共享的特征空间，用于解析与说服相关结果的修辞模式。线性模型分析表明，PI特征具备显著的预测能力且计算开销低；维度层面的分析进一步揭示了各维度与说服效果之间的普遍关联，同时也发现了受话题与立场影响的特异性变化。研究已将PI以开源包及网页界面形式发布，支持对人类与人工智能媒介化沟通进行系统化、可审计的分析。

链接: https://arxiv.org/abs/2606.14580
作者: Liancheng Gong,Zhiyang Wang,Yiwei Xu,Julia Mendelsohn
机构: University of Maryland, College Park(马里兰大学学院公园分校); New York University(纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15 dimensions grounded in persuasion theories from psychology and communication, and one transparent implementation using 55 sub-features built from lexicons and rule-based detectors. The taxonomy is modular: individual detectors can be replaced while preserving the theoretical structure. By evaluating PI on four public datasets varying in domain, style, and outcome measures, we show that PI provides a shared feature space for interpreting rhetorical patterns associated with persuasion-related outcomes. Linear models show that PI features carry meaningful predictive signal while remaining computationally lightweight. Dimension-level analyses reveal recurring associations between PI dimensions and persuasion outcomes across datasets, while also highlighting topic- and stance-specific variation. We release PI as an open-source package and web interface for principled and auditable analysis of human and AI-mediated communication.

[NLP-12] SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在家庭环境自治代理规划中普遍存在但被现有评估基准忽视的“潜在失效”（latent failures）问题。与执行时即刻暴露并可纠正的即时失败不同，潜在失效不会立即中断计划执行，却会悄然破坏目标达成，严重时甚至导致不可逆的损害。为填补这一评估空白，论文提出SIMMER基准，基于厨房场景构建由人工校准的符号化世界模型，包含77个动作、262种唯一物体及约46,800种语义合理的交互关系，其数据源自真实烹饪脚本。SIMMER通过状态机执行器对生成计划进行验证，能够检测即时先决条件违规、潜在风险及不可逆失败。实验表明，即使前沿LLM模型也仅能生成最多17%无错误的计划，且高达56%的计划存在潜在失效，其中多数导致不可逆后果。研究进一步证明，通过反事实预见模拟（counterfactual foresight simulation）进行显式状态推理，可将潜在失效降低达72%，不可逆案例减少75%，揭示了提升LLM规划鲁棒性的有效路径。

链接: https://arxiv.org/abs/2606.14574
作者: Xiaoxin Lu,Ranran Haoran Zhang,Rui Zhang
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

[NLP-13] BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

【速读】：该论文旨在解决现有语音语言模型（SpeechLM）在实时全双工语音交互中存在的一致性与灵活性不足的问题，即当前主流模型如LLaMA-Omni和GLM-4-Voice仍采用轮次制（turn-based）交互模式，依赖外部语音活动检测（Voice Activity Detection, VAD）模块判断用户发言结束，难以支持重叠对话、停顿、抢话等自然交互现象。其核心解决方案是提出一种原生全双工语音语言模型——BayLing-Duplex，通过引入少量特殊标记（special tokens）对标准自回归大语言模型（LLM）进行轻量级改造，使模型能够自主决策何时倾听、何时说话以及何时停止，无需额外的转交控制模块。该设计具备良好的可迁移性，兼容现有模型架构与训练推理栈，仅需使用40万条全双工样本进行微调，并辅以轻量级直接偏好优化（DPO）阶段，即可在InstructS2S-Eval上实现92%的转交成功率和100%的打断成功率，同时将语音响应质量评分从Moshi的2.17提升至3.39。此外，该模型在多项基准测试中表现不逊于甚至超越其轮次制对应模型，证明了全双工建模在不牺牲响应质量的前提下显著增强了交互自然性与实时性。

链接: https://arxiv.org/abs/2606.14528
作者: Qingkai Fang,Shoutao Guo,Yang Feng
机构: Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Code: this https URL

点击查看摘要

Abstract:Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in. Existing speech language models (SpeechLMs) such as LLaMA-Omni and GLM-4-Voice are still turn-based and rely on an external Voice Activity Detection (VAD) module to mark the end of the user’s turn, which fundamentally limits their interactive ability. In this paper, we introduce BayLing-Duplex, a native full-duplex SpeechLM where a single autoregressive LLM decides when to listen, when to speak, and when to stop, with no auxiliary turn-taking module. The design adds only a few special tokens to the standard vocabulary, so it transfers across LLMs and reuses existing training and serving stacks with no architectural adaptation. Starting from the public GLM-4-Voice checkpoint and using only 400K full-duplex samples for fine-tuning followed by a lightweight DPO stage, BayLing-Duplex reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, while improving the speech-response score from 2.17 to 3.39 over Moshi. BayLing-Duplex also matches or surpasses its turn-based counterpart on Llama Questions, Web Questions, and Alpaca-Eval, showing that simultaneous listen-and-speak modeling does not sacrifice response quality.

[NLP-14] Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

【速读】：该论文旨在解决当前人工智能评估（AI evaluation）领域中存在的结果异构性问题，即不同评估工具、框架和发布渠道导致的评估结果格式不兼容、元数据记录不一致，从而阻碍了跨研究、跨社区的可比性分析、成本控制与成果复用。其核心解决方案是提出“Every Eval Ever”——首个面向AI评估结果的共享数据模式与社区众包数据库。该方案的关键在于：构建了一个以JSON为载体的统一、无源依赖（source-agnostic）的标准化数据模式，支持从评估框架、论文、排行榜及自定义仓库中无缝集成数据，并可选存储实例级输出以实现细粒度分析；同时配套开发了针对主流格式、评估工具和排行榜的自动转换器，并建立基于Hugging Face的社区驱动数据库，目前已涵盖22,235个模型、2,273个独特基准测试和31种评估格式，实现了评估结果的可追溯、可比较与可重用。

链接: https://arxiv.org/abs/2606.14516
作者: Jan Batzner,Sree Harsha Nelaturu,Anastassia Kornilova,Jon Crall,Tommaso Cerruti,Yanan Long,Yifan Mai,Sanchit Ahuja,Asaf Yehudai,Marek Šuppa,John P. Lalor,Oluwagbemike Olowe,Jatin Ganhotra,Brian H. Hu,Eliya Habba,Andrew M. Bean,Chang Liu,Sander Land,Steven Dillmann,Aniketh Garikaparthi,Elron Bandel,Saki Imai,James Edgell,Wm. Matthew Kennedy,Jenny Chim,Patrick Meusling,Asteria Kaeberlein,Venkata Ramachandra Karthik Chundi,Manasi Patwardhan,Martin Ku,Austin Meek,Leon Knauer,Brian Wingenroth,Srishti Yadav,Usman Gohar,Felix Friedrich,Michelle Lin,Jennifer Mickel,Arman Cohan,Stella Biderman,Irene Solaiman,Zeerak Talat,Anka Reuel,Mubashara Akhtar,Gjergji Kasneci,Avijit Ghosh,Leshem Choshen
机构: Technical University Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Weizenbaum Institute (韦森鲍姆研究所); Zuse Institute Berlin (祖斯柏林研究所); Evidence Prime (证据优先公司); Trustible (可信公司); Kitware (Kitware公司); ETH Zurich (苏黎世联邦理工学院); StickFlux Labs (StickFlux实验室); Stanford University (斯坦福大学); Northeastern University (东北大学); IBM Research (IBM研究院); Comenius University Bratislava (科马纽斯布拉迪斯拉发大学); Cisco (思科公司); University of Notre Dame (圣母大学); Independent (独立); Hebrew University of Jerusalem (耶路撒冷希伯来大学); University of Oxford (牛津大学); Ohio University (俄亥俄大学); Writer (作家); TCS Research (TCS研究院); Oxford University Press (牛津大学出版社); Queen Mary University of London (伦敦玛丽女王大学); Technical University Berlin (柏林工业大学); University of Delaware (特拉华大学); Cinemo (Cinemo公司); Johns Hopkins University (约翰霍普金斯大学); University of Copenhagen (哥本哈根大学); ELLIS (ELLIS); Iowa State University (爱荷华州立大学); Meta FAIR (Meta FAIR); University of Montreal (蒙特利尔大学); Mila Quebec AI Institute (魁北克蒙特利尔人工智能研究所); EleutherAI (EleutherAI); Hugging Face (Hugging Face); University of Edinburgh (爱丁堡大学); Harvard University (哈佛大学); ETH AI Center (ETH人工智能中心); MIT (麻省理工学院); MIT-IBM Watson Lab (MIT-IBM沃森实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

[NLP-15] Fodor and Pylyshyns Systematicity Challenge Still Stands ACL

【速读】：该论文旨在解决认知科学中关于神经网络是否能够解释人类语言与思维系统性（systematicity）的核心争议。系统性指个体在理解某类句子时表现出的双向依赖关系，例如理解“约翰看见玛丽”必然伴随对“玛丽看见约翰”的理解，这一现象传统上由符号系统（symbolic systems）合理解释，而神经网络因缺乏明确的组合性机制被认为难以提供同等解释。尽管近期有研究（如Brenden Lake与Marco Baroni提出的基于元学习的组合性协议）声称已实现对人类系统性的模拟，本文指出其结论尚不成熟。关键问题在于，该模型在面对略微超出训练数据分布（out-of-distribution）的规则时表现显著下降，且在部分分布内任务中亦呈现非系统性行为。因此，本文认为，弗多尔与皮利申针对神经网络提出的系统性挑战仍未被真正克服。

链接: https://arxiv.org/abs/2606.14512
作者: Michael Goodale,Salvador Mascarenhas
机构: Institut Jean Nicod, Département d’études cognitives; ENS, EHESS, CNRS, PSL University (巴黎科学与文学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in the Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version of the paper

点击查看摘要

Abstract:The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence “John saw Mary” just in case that they understand the sentence “Mary saw John.” Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn’s challenge to neural networks remains unmet.

[NLP-16] GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay Diff and Merge

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）推理过程缺乏持久化与可追溯性的问题：当前的推理链（chain of thought）在上下文窗口限制下瞬时消失，搜索分支被剪枝后无迹可寻，且记忆缓冲区无法进行版本比对、合并或审计。与代码、数据、实验等复杂软件流程普遍采用版本控制不同，推理过程仍处于“无版本控制”状态。为此，论文提出GitOfThoughts，将智能体的推理树以Git仓库的形式存储——每个带评分的思考作为一次提交（commit），评分作为注释（note），最终结果作为标签（tag），而推理回溯则通过“git log”实现。该方案使推理具备可重放性、可审计性，并支持跨智能体的推理合并，工程成本近乎为零。进一步地，论文探讨了记忆（memory）在任何载体中是否真正提升推理准确率的问题。在五种记忆载体（无记忆、Markdown、向量、图结构、Git）、两个基准测试、两种模型规模及预注册复制实验下，研究发现：对于新问题，任何记忆格式均未表现出稳定增益；早期看似有希望的结果在预注册复制中失效。只有当检索到的案例与当前问题高度相似（相似度约0.8）时，准确率才显著跃升——这表明收益源于答案检索而非方法迁移。即使使用4.5倍更大的模型，也无法从已解例题中提取可迁移的方法。唯一普遍有效的提升手段是测试时采样（test-time sampling）。因此，采用Git作为记忆载体的核心价值在于实现可审计性、可溯源性和可合并性，而在准确率上与其它形式持平。作者还主动报告了一项撤稿结果和一项被证伪的假设，以体现其严格的评估标准。

链接: https://arxiv.org/abs/2606.14470
作者: Pavan C Shekar,Abhishek H S,Aswanth Krishnan
机构: QpiAI(量子人工智能)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 1 figure, 9 tables

点击查看摘要

Abstract:Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent’s reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is “git log” over the agent’s own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity ~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to. Comments: 10 pages, 1 figure, 9 tables Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.6; D.2.7 Cite as: arXiv:2606.14470 [cs.AI] (or arXiv:2606.14470v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.14470 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-17] MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition INTERSPEECH2026

【速读】：该论文旨在解决现代自动语音识别（ASR）系统在真实世界分布偏移下性能下降的问题，尤其关注录音条件、口音、言语障碍和噪声等因素在实际应用中常共现但现有数据集与评估基准通常将其孤立处理的局限性。其核心解决方案是提出一个模块化诊断持续学习数据集（MoDiCoL），能够对语言内容、说话人特征和声学环境进行受控分析，并设计了一种受现实场景启发的持续学习教学方案，以模拟模型在增量更新中的鲁棒性获取、迁移与遗忘过程。该方法的关键在于将模型鲁棒性视为一种动态发展的能力，通过结构化、可控制的持续学习范式，深入揭示鲁棒性在复杂多变环境下的演化机制。

链接: https://arxiv.org/abs/2606.14459
作者: Theresa Pekarek Rosin,Matthias Kerzel,Stefan Wermter
机构: University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech impairments, and noise. Existing datasets and benchmarks typically isolate these factors, which overlooks their co-occurrence in real-world applications. In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. Furthermore, we propose a real-world-inspired continual learning curriculum to simulate incremental updates and study how robustness is acquired, transferred, and forgotten. We evaluate three continual learning strategies and provide detailed insights into robustness under evolving conditions.

[NLP-18] Coping in Crisis: Computational Modeling of Coping Styles in Digital Crisis Discourse During the 2023 Turkiye Earthquake

【速读】：该论文旨在解决在重大灾难发生后，如何在大规模、实时的数字文本数据中识别和监测公众的应对策略（coping styles）这一关键问题。其核心挑战在于，传统心理学理论难以直接应用于海量、非结构化的社交媒体文本，且在政治高度分裂的背景下，公众情绪与应对行为的动态演变尤为复杂。解决方案的关键在于：基于Lazarus和Folkman的应对理论（coping theory），构建一个针对土耳其语的多标签BERT模型——BERTurk，用于识别三种核心应对方式（问题导向型应对、情绪导向型应对和意义建构型应对），并将其映射到四个理论驱动的危机阶段。该模型在宏观F1-score上达到0.693，显著优于零样本mDeBERTa基线（0.324）。实证分析揭示了应对策略随时间演化的清晰轨迹：问题导向应对在紧急期占主导并迅速下降，情绪导向应对逐步上升并趋于稳定，而意义建构则持续增长。此外，愤怒与意义建构呈显著正相关（Spearman r = 0.387），表明愤怒更倾向于激发归责动机而非实际行动。研究证明，应对理论可在真实世界数字危机数据中可靠地操作化，为人道主义组织提供精准响应依据，实现基于民众实际心理状态的动态干预。

链接: https://arxiv.org/abs/2606.14420
作者: Şevval Çakıcı
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 5 figures, 3 tables. To be submitted to Social Science Computer Review

点击查看摘要

Abstract:How do people cope when disaster strikes and can we detect it at scale, in real time, from what they write? This study addresses that question using over one million Turkish-language tweets posted in the aftermath of the February 6, 2023 earthquake in Turkiye, which unfolded in a deeply polarized political context just months before a national election. Drawing on Lazarus and Folkman’s (1984) coping theory, we develop a multi-label BERTurk classifier to detect three coping styles (problem-focused, emotion-focused, and meaning-making) across four theoretically motivated crisis phases. BERTurk achieves a macro F1 of 0.693, substantially outperforming a zero-shot mDeBERTa baseline (macro F1 = 0.324). Applied to the full corpus, the classifier reveals a clear temporal trajectory: problem-focused coping dominates the urgency phase and declines sharply, emotion-focused coping rises and stabilizes, and meaning-making increases monotonically. Anger correlates most strongly with meaning-making (Spearman r = 0.387), suggesting it functions as a mobilizing force toward blame attribution rather than practical action. These findings demonstrate that coping theory can be reliably operationalized in real-world digital crisis data and that doing so can help humanitarian organizations tailor their responses to where a population actually is.

[NLP-19] Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR INTERSPEECH2026

【速读】：该论文旨在解决大模型自动语音识别（ASR）系统在处理言语不流畅（disfluent speech）时存在的信息丢失与幻觉问题。现有先进系统通常优化为忽略或消除不流畅现象，导致语义信息缺失。尽管已有研究尝试通过逐字转录和引入不流畅标记来改善这一问题，但在小规模数据集上进行模型微调易引发对通用领域知识的灾难性遗忘。本文提出一种基于持续学习（Continual Learning, CL）的解决方案，其关键在于引入显式的不流畅标记令牌（disfluency tokens），首先在预训练的ASR模型中建立稳定的标记机制，随后在具有不同不流畅分布的额外数据集上进行持续训练。通过分析训练过程中的模型动态，研究发现标记学习与标准ASR性能之间存在权衡，并揭示了一种在各类持续学习方法中一致存在的跨注意力头机制，为提升模型在复杂口语场景下的鲁棒性提供了新思路。

链接: https://arxiv.org/abs/2606.14391
作者: Henri-Leon Kordt,Theresa Pekarek Rosin,Jae Hee Lee,Stefan Wermter
机构: University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a pretrained ASR model to establish stable token mechanisms, and then continue training on additional datasets with varying disfluency distributions. Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods.

[NLP-20] Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

【速读】：该论文旨在解决多领域大语言模型（Large Language Model, LLM）训练中如何实现双模型在不同领域间协同进化的问题，核心挑战在于避免传统单向知识蒸馏或单一模型微调带来的性能退化，即在提升跨域能力的同时保持各模型在原优势领域的强表现。为此，提出一种基于策略反馈的共蒸馏方法——在策略共蒸馏（On-Policy Co-Distillation, OPCoD），其关键在于：每个学生模型的自蒸馏过程不仅依赖自身正确轨迹（correct rollout），还引入同伴模型的反馈；通过认知感知门控（cognizance-based gating）动态决定反馈时机，并利用反馈锚定（feedback anchoring）机制将反馈内容与具体问题语境对齐，从而确保反馈的有效性与可解释性。实验表明，在科学问答任务上，OPCoD能持续超越基线方法，并在所有评估的领域组合与学生模型中实现帕累托改进（Pareto improvement），即在不损失原有优势的前提下全面提升跨域性能。

链接: https://arxiv.org/abs/2606.14368
作者: Woohyeon Byeon,Jiwon Jeon,Jeonghye Kim,Youngchul Sung
机构: KAIST(韩国科学技术院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study multi-domain LLM training in which two models, each stronger in a different domain, co-evolve by tutoring each other through on-policy feedback. Unlike one-way distillation or single-model fine-tuning, our goal is mutual Pareto improvement: each model improves across domains without losing its original strength. To this end, we propose On-Policy Co-Distillation (OPCoD), where each student’s self-distillation is conditioned on its own correct rollout and feedback from its peer. To make feedback exchange effective, OPCoD uses cognizance-based gating to decide when to give feedback and feedback anchoring to ground feedback in the problem. On Science Q\A tasks, OPCoD consistently outperforms baselines and achieves Pareto improvement across all evaluated domain pairs and students.

[NLP-21] Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

【速读】：该论文旨在解决在属性图（Property Graph）数据库中实现精准信息访问时，缺乏高效且可本地部署的自然语言到Cypher查询转换（Text-To-Cypher）系统的问题。其核心挑战在于，现有方法依赖大规模标注数据来训练模型，而此类数据获取成本高昂且难以满足数据主权（data-sovereignty）要求。为此，论文提出一种自动化的合成数据生成方法，通过生成高质量、多样化的文本-查询配对数据，用于微调小型大语言模型（small LLMs）。该方案的关键在于利用合成数据有效提升小模型在多个主流Text-To-Cypher基准上的性能，使其达到与大型专有模型相当的准确率，从而在无需外部标注或云端部署的前提下，实现本地化、高精度的对话式查询接口，兼顾数据隐私保护与系统性能。

链接: https://arxiv.org/abs/2606.14325
作者: Francesco Cazzaro,Jessica Lennon,Ariadna Quattoni
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Property Graphs are rapidly being adopted as database frameworks for representing heterogeneous data sources. To enable precise access to the information contained in them we need conversational interfaces based on Text-To-Cypher (Text2Cypher) parsers. This paper presents an automatic synthetic data generation method that can be leveraged to fine-tune small LLMs for this task. We conduct experiments on all the major Text-To-Cypher benchmarks, demonstrating that with our synthetic data generation approach we can significantly increase the performance of small LLMs, allowing them to compete with much larger proprietary models. This means that in settings in which models must be locally deployed we can ensure data-sovereignty without sacrificing accuracy and without costly annotation campaigns.

[NLP-22] Retrospective Progress-Aware Self-Refinement for LLM Agent Training

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的智能体在强化学习训练中缺乏元认知意识（metacognitive awareness）的问题，尤其是在长时程任务中对任务进展的自我评估能力不足，导致其难以有效扩展至复杂、长期的任务场景。其核心挑战在于，传统的基于结果奖励的训练方式无法自发催生智能体对自身行为进展的反思能力。为此，论文提出RePro（Retrospective Progress-Aware Training）框架，其关键创新在于采用“前向执行-事后反思”（forward-then-reflect rollout）的训练范式：智能体先在线执行动作序列，随后基于已完成的轨迹和已知结果，回溯性地生成对每一步进展的自我评估信号。RePro通过初始的“回溯预热”（Retrospection Warmup）阶段，利用少量外部示范学习反思格式，并进一步通过RePro-PO方法，结合复合奖励机制实现无需持续外部监督的自生成进展信号训练。实验在WebShop、ALFWorld和Sokoban等基准任务上验证了该方法的有效性，显著提升了Qwen系列模型的性能，最高实现12%的绝对成功率提升。

链接: https://arxiv.org/abs/2606.14302
作者: Xinbei Ma,Congmin Zheng,Jiyang Qiu,Jiale Hong,Yao Yao,Xiangmou Qu,Jiaxin Yin,Xingyu Lou,Jun Wang,Weiwen Liu,Weinan Zhang,Zhuosheng Zhang,Hai Zhao
机构: Shanghai Jiao Tong University (上海交通大学); OPPO Research Institute (OPPO研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospective demonstrations help, yet this capability cannot emerge from outcome-reward training alone. We present RePro, Retrospective Progress-Aware Training, a framework that trains agents to self-generate progress signals via a forward-then-reflect rollout paradigm: the agent executes actions online, then retrospectively reassesses its step-wise progress given the completed trajectory and known outcome. RePro initializes with a Retrospection Warmup that teaches reflection format from minimal external demonstrations, then further trains through RePro-PO with a composite reward that produces self-generated signals without continuous external supervision. Experiments on WebShop, ALFWorld, and Sokoban show that RePro enhances the Qwen family’s performance, with up to 12% absolute success rate gains.

[NLP-23] Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM -as-a-Judge

【速读】：该论文旨在解决生成式 AI（Generative AI）在开放式指令遵循评估中，依赖大语言模型（LLM）作为自动评判者时存在的可靠性问题：即评判结果是否真正反映答案质量，还是受到比较表述语言形式的干扰。其核心解决方案是提出一种轻量级元评估协议 Judge-LS，通过将原始英文回答对样本转换为英文、中文及中英语言切换三种变体，检验评判模型在标签保持不变的语言变换下是否维持一致偏好。可靠评判者应在翻译等价的情况下不偏袒任一语言，且偏好应保持稳定。实验基于419个项目的 LLMBar 基准，在四款可访问API的评判模型上生成13,408次有效成对判断，结果显示中文及语言切换版本相较英文引发10.7%至14.4%的偏好反转，所有模型在英文下的准确率最高；然而，对翻译等价的平局探测样本分析表明，并未出现系统性英文偏好——多数被判定为平局，非平局决策中反而更倾向中文。研究进一步引入置信区间、配对显著性检验以及自动化转换审计与敏感性分析（剔除机械标记的高风险变体），验证了方法的稳健性。整个实验无需模型训练，仅依赖API调用，可在普通本地硬件上实现，具备高度可扩展性与实用性。

链接: https://arxiv.org/abs/2606.14278
作者: Shaojie Yin
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it also introduces a new reliability question: does a judge evaluate the quality of an answer, or does it also react to the language in which the comparison is presented? We propose Judge-LS, a lightweight meta-evaluation protocol that transforms LLMBar response-pair items into English, Chinese, and Chinese-English language-switched variants. A reliable judge should preserve its preference under label-preserving language transformations and should not prefer a language when two answers are translation-equivalent. We evaluate four API-accessible judges on the full 419-item LLMBar benchmark, producing 13,408 successful pairwise judgments. Across models, Chinese and language-switched presentations induce 10.7–14.4% preference flips relative to English, and all judges achieve their highest accuracy in English. However, translation-equivalent tie probes do not reveal a systematic English preference: most probes are judged as ties, and non-tie decisions more often favor Chinese. We add confidence intervals, paired significance tests, and an automatic transformation audit with a sensitivity analysis that excludes mechanically flagged high-risk variants. The experiment requires no model training, uses only API calls, and is feasible on modest local hardware.

[NLP-24] he Linguistics Olympiads: Towards a New Corpus for Linguistics Research?

【速读】：该论文旨在解决语言学奥林匹克竞赛题（Linguistics Olympiad Problems, LOPs）虽在国际上广泛开展且具备潜在研究价值，却尚未被系统整合进主流语言学研究中的问题。其核心挑战在于如何科学评估LOPs作为语言学研究数据源的适用性，并建立其在学术研究中负责任使用的标准。解决方案的关键在于：基于超过1800道LOPs的实证分析，系统评估其作为新型语料库在语言学研究中的潜力，明确其在语言类型学、语言相对性及田野语言学等领域的适配性，同时揭示其作为研究工具在代表性和局限性方面的特征，并提出一套结构化的评价标准与理论框架，以推动LOPs从竞赛型谜题向可信赖学术资源的转化，从而弥合语言学奥赛与学术语言学之间的鸿沟。

链接: https://arxiv.org/abs/2606.14257
作者: Vlad A. Neacsu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for publication in LingBaW. Linguistics Beyond and Within (Volume 12, 2026)

点击查看摘要

Abstract:Linguistics olympiad problems (LOPs) are a category of self-sufficient puzzles consisting of a scaled-down corpus representative of certain linguistic phenomena, from which the solver must deduce a primitive set of rules of the language and then translate a new set of elements. The linguistics olympiads (LOs) have become a worldwide phenomenon with 43 different territories taking part in the International Linguistics Olympiad (IOL) 2025. While the typology and solving strategies of LOPs have been analysed, their scientific facet and connections to academic linguistics have yet to be explored. LOPs are directly connected to many linguistic fields, e.g., linguistic typology, linguistic relativity, and linguistics fieldwork. Recently, LOPs have become a research focus as benchmarks for large language models, thus highlighting their usefulness in computational linguistics. Nevertheless, they have not yet been integrated into mainstream linguistics research. This paper attempts to open new directions of including this particular type of puzzle in academic research by offering a structured evaluation of LOPs as linguistic data sources and proposes criteria for their responsible use in academic research. Starting from a set of over 1800 LOPs, this study critically examines the potential of LOPs as a novel corpus for linguistics research by discussing their strengths and limitations as tools, as well as the areas of linguistics into which these problems could fit. This work forms the foundation for a broader initiative aimed at bridging the gap between LOs and academic linguistics, by establishing a robust theoretical framework for LOPs.

[NLP-25] Decoupled Mixture-of-Experts for Parametric Knowledge Injection

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在注入外部、领域特定或时效性知识时面临的灵活性与集成效率之间的权衡问题。现有方法中，基于检索的增强生成仅能实现提示层的知识补充，而基于后训练的方法虽可将知识编码至共享参数中，却易引发灾难性遗忘、知识冲突及高昂的更新成本。为此，论文提出解耦型专家混合模型（Decoupled Mixture-of-Experts, DMoE），其核心创新在于将专家模块与路由机制从基础模型中解耦，使外部知识语料可转化为独立可更新的专家模块，并通过轻量级的不确定性感知路由机制，在生成过程中仅当基础模型知识不足时激活相关专家。为保障自回归推理效率，DMoE仅将专家附加于最后一层前馈网络，从而在保持键值缓存（KV-cache）复用的同时实现参数级别的知识增强。实验结果表明，DMoE在多个知识密集型基准测试中均显著优于基于检索和适配器的基线方法，验证了其在知识注入任务中的有效性与优越性。

链接: https://arxiv.org/abs/2606.14243
作者: Baoqing Yue,Weihang Su,Qingyao Ai,Yichen Tang,Changyue Wang,Jiacheng Kang,Jingtao Zhan,Yiqun Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge injection aims to equip large language models (LLMs) with external, domain-specific, or time-sensitive knowledge. Existing approaches typically face a trade-off between flexibility and integration: retrieval-augmented generation keeps knowledge outside the model but only provides prompt-level augmentation, whereas post-training based methods encode new knowledge into shared parameters but may introduce catastrophic forgetting, knowledge conflict, and costly updates. In this paper, we propose Decoupled Mixture-of-Experts (DMoE), a modular architecture for parametric knowledge injection that decouples both experts and the router from the base model. DMoE converts external knowledge corpora into independently updatable expert modules and uses a lightweight uncertainty-aware router to activate relevant experts only when the base model lacks sufficient knowledge during generation. To support efficient auto-regressive inference, DMoE attaches experts only to the final-layer feed-forward network, preserving KV-cache reuse while enabling parameter-level knowledge augmentation. Experiments on knowledge-intensive benchmarks show that DMoE consistently improves answer quality over retrieval and adapter-based baselines.

[NLP-26] A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

【速读】：该论文旨在解决当前深度伪造（Deepfake）检测技术在面对基于扩散模型生成的高保真内容时性能下降的问题，尤其针对现有方法在跨生成器（cross-generator）和跨范式（cross-paradigm）场景下鲁棒性不足、缺乏对多域特征互补性的系统性利用等关键挑战。其解决方案的核心在于提出一种名为SGFF-Net（Spatial-Gradient-Frequency Fusion Network）的多域融合检测框架，通过在双残差学习架构中整合空间域、梯度域与基于离散小波变换（DWT）的频率域特征，实现对多种伪造痕迹的协同建模。该方法不仅提升了单一数据集上的检测精度（达98.95%），更显著增强了在跨模型（70.46% → 79.80%）与跨范式（69.94% → 78%）评估中的泛化能力，证明了多源特征融合与数据多样性增强对于提升深度伪造检测系统鲁棒性和实用性的重要作用。

链接: https://arxiv.org/abs/2606.14230
作者: Amna Amjid,Sana Qadir,Mehwish Fatima,Raja Khurram Shahzad
机构: National University of Sciences and Technology (NUST)(巴基斯坦国立科技大学); Mid Sweden University (中瑞典大学); Lulea University of Technology (吕勒奥理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deepfakes are artificially generated images, audio, or videos that threaten privacy, security, and information integrity. Detecting such content is crucial for countering disinformation, as the latest models generate highly realistic content. While spatial- or frequency-based approaches achieve good detection rates on Generative Adversarial Networks (GANs)-based generated deepfakes, they often struggle with recent diffusion model-generated images. In particular, existing approaches rarely exploit complementary multi-domain representations or systematically evaluate cross-generator robustness. To address these challenges, we propose a multi-domain deepfake detection framework called SGFF-Net (Spatial-Gradient-Frequency Fusion Network) that integrates spatial, gradient, and DWT (Discrete Wavelet Transform)-based frequency representations within a dual residual learning architecture. Experimental results show that the SGFF-Net achieves 98.95% accuracy in intra-dataset evaluation and improves performance in both cross-model (70.46%) and cross-paradigm (69.94%) settings. Incorporating multi-source training and data augmentation further enhances robustness, increasing accuracy from 70.46% to 79.80% in cross-model evaluation, from 69% to 78% in cross-paradigm evaluation, and from 61.50% to 75.80% on real-world data. Unlike single-domain detectors, the SGFF-Net learns complementary forensic cues across spatial, gradient, and wavelet-frequency domains, resulting in greater robustness under cross-generator and cross-paradigm evaluation. The results further show that combining multi-domain representations with data diversity and augmentation substantially improves generalization, providing practical insights for developing more reliable deepfake detection systems.

[NLP-27] Detecting undisclosed LLM -generated content in parliamentary texts

【速读】：该论文旨在解决议会文本中未披露的大型语言模型（Large Language Model, LLM）生成内容泛滥的问题。尽管在新闻写作或学术写作等领域普遍要求明确标注是否使用了AI工具，但议会文本领域对AI使用的披露规范尚不清晰，存在透明度不足的风险。为保障公众信任与决策透明性，研究主张议员在撰写议会动议等正式文本时应主动声明是否借助了AI。本文的关键解决方案是构建一个可解释的（glass-box）文本分类器，利用预训练阶段的议会原始文本与对应的LLM生成版本进行训练，从而识别文本中是否存在未披露的生成式内容。通过将该分类器应用于近期议会文本测试集，研究发现自2022年起，英国与瑞典议会中未披露的LLM使用率呈现持续上升趋势，揭示了当前监管空白下的潜在风险。

链接: https://arxiv.org/abs/2606.14209
作者: Minerva Suvanto,Andrea McGlinchey,Peter J. Barclay,Mattias Wahde
机构: Chalmers University of Technology (查尔姆斯理工大学); University of Glasgow (格拉斯哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we evaluate the extent of undisclosed LLM-generated content in texts from the parliaments of the United Kingdom and Sweden. In many areas, such as in journalism or in academic writing, there are often requirements to clearly disclose whether AI tools, such as LLMs, have been used. In the case of parliamentary texts, the guidelines on disclosure of AI use are more vague. However, in order to maintain transparency and retain public trust, it is generally recommended that parliamentarians should state whether or not they have used AI when writing texts, such as parliamentary motions. Here, we train an interpretable (glass-box) text classifier using pre-LLM parliamentary texts and LLM-generated versions of such texts. We then apply the classifier to a test set containing recent parliamentary texts, finding a steady increase in undisclosed LLM use, in both parliaments, from 2022 onwards.

[NLP-28] OdysSim: Building Foundation Models for Human Behavior Simulation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在作为人类模拟器进行交互评估与社会仿真时，因以“帮助性”为导向的后训练导致模型行为趋于同质化、过度顺从，从而产生行为模拟与真实人类行为之间的“模拟到现实差距”（Sim2Real gap）的问题。其解决方案的关键在于提出SOUL分类法——一个涵盖五个能力维度（对话能力/CONV、社会性/SS、认知能力/COG、角色扮演/ROLE、评估能力/EVAL）的统一框架，整合62个数据集和23项基准任务；构建包含2140万次交互、100亿词元的OdysSim语料库，并通过回溯生成的社会情境增强数据；设计端到端的训练方法，结合中期训练、任务特定强化学习（RL）与专家知识蒸馏；最终得到的8B参数规模的OSim模型在23项任务中有8项排名第一或并列第一，显著优于单一前沿模型，在对话与社会任务上表现尤为突出，且输出在长度、格式和词汇选择上更接近真实人类，零样本迁移至τ-bench上的分布外用户模拟任务时，反应对齐度达到93.2（接近真实用户93.5）。此外，研究还揭示了“以大模型为裁判”的强化学习会诱发奖励黑客行为，并提出检测机制予以缓解。综合表明，行为基础模型（Behavioral Foundation Models）需要重新思考现有大语言模型的训练范式。所有相关数据与代码均已开源，以支持后续研究。

链接: https://arxiv.org/abs/2606.14199
作者: Xuhui Zhou,Weiwei Sun,Weihua Du,Jiarui Liu,Haojia Sun,Qianou Ma,Tongshuang Wu,Yiming Yang,Maarten Sap
机构: Carnegie Mellon University, Language Technologies Institute (卡内基梅隆大学语言技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages. Code: this https URL ; Models and data: this https URL

点击查看摘要

Abstract:Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21.4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation. The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on \tau -bench, nearly matching real users on reaction alignment (93.2 vs. 93.5). We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.

[NLP-29] CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

【速读】：该论文旨在解决小规模智能体基础模型在多步骤工具调用任务中训练效率与性能之间的矛盾问题，具体针对三大实际挑战：如何大规模迁移大模型的工具调用知识、在不依赖高成本实时工具执行的前提下实现强化学习、以及在噪声丰富的缓存环境中实现稳健学习。其解决方案的关键在于提出CacheRL系统，包含三项核心创新：一是采用混合思维轨迹（hybrid thinking trajectory）管道，通过大语言模型（LLM）生成的推理轨迹增强智能体轨迹，使模型不仅学习“调用哪些工具”，更理解“为何调用”；二是设计CacheAgentLoop机制，利用三级模糊缓存（three-tier fuzzy cache）消除实时执行开销，同时通过标记级别掩码（token-level masking）保持轨迹保真度；三是引入缓存层级感知奖励（cache-tier-aware reward），动态调整答案质量权重，避免因缓存导致的局限性对模型进行错误惩罚。实验表明，通过迭代式监督微调（SFT）与组相对策略优化（GRPO），Qwen3-4B-Thinking模型在验证集上的奖励从0.43提升至0.78，在公开基准上表现接近GPT-5（94% vs 92%过程准确率），且消融实验验证了知识迁移和缓存感知奖励的关键作用，揭示数据质量和奖励设计比复杂优化方法对构建实用小型智能体模型更为重要。

链接: https://arxiv.org/abs/2606.14179
作者: Md Amirul Islam,Sumiran Thakur,Huancheng Chen,Su Min Park,Jiayun Wang,Gyuhak Kim
机构: Accenture(埃森哲)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5’s 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why. Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking’s validation reward from 0.43 to 0.78. On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement. Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.

[NLP-30] Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agent ic Systems

【速读】：该论文旨在解决多大语言模型（Large Language Model, LLM）智能体系统中上下文自适应（context adaptation）的挑战，特别是现有方法在任务反馈下的信用分配不准确以及缺乏收敛性保证的问题。其核心解决方案是提出一种基于图的目标反向传播（Graph-based Target Back-Propagation, GTBP）框架，将多智能体工作流建模为有向无环图（Directed Acyclic Graph, DAG），通过在图结构中将局部目标输出反向传播，并利用目标输出之间的差异指导分阶段提示更新机制。该方法的关键在于通过图结构实现精确的信用分配与可证明的稳定性——理论上证明了分阶段提示更新在迭代过程中趋于稳定，且具备足够能力的LLM优化器能够降低整体目标函数；实验证明GTBP在三个基准测试上持续优于强基线方法，同时保持相近的计算开销。

链接: https://arxiv.org/abs/2606.14155
作者: Tan Zhu,Tong Yao,Kananart Kuwaranancharoen,Amit Singh,Yushang Lai,Deepa Mohan,Shankara Bhargava
机构: Walmart Global Tech(沃尔玛全球科技)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Context adaptation automates prompt engineering in LLM-based systems by iteratively revising tunable prompts from task feedback, without modifying model weights. Extending this paradigm to multi-LLM agentic systems is crucial: existing methods suffer from inaccurate credit assignment and lack convergence guarantees. We propose \textbfGraph-based \textbfTarget \textbfBack-\textbfPropagation (GTBP), a context adaptation framework for agentic workflows modeled as directed acyclic graphs. GTBP propagates local target outputs backward through the workflow graph and uses target–output discrepancies to guide a stage-wise prompt update mechanism. Theoretically, we show that GTBP’s stage-wise prompt updates become stable over iterations, and that a sufficiently capable LLM optimizer can decrease the overall objective. Empirically, GTBP consistently outperforms strong baselines across three benchmarks while maintaining comparable computational cost.

[NLP-31] Small LLM s: Pruning vs. Training from Scratch

【速读】：该论文旨在解决生成式AI（Generative AI）领域中如何高效构建小型语言模型的问题，具体聚焦于剪枝（pruning）技术在降低模型规模的同时能否有效保持甚至提升模型性能。研究通过在Llama-3.1-8B模型上采用六种覆盖深度、宽度及稀疏粒度的剪枝方法，在两种受控的令牌匹配（token-matched）设置下评估剪枝效果。其核心发现表明：当训练令牌预算有限时，基于预训练模型的剪枝初始化显著优于随机初始化，说明父模型提供了强大的初始参数优势；然而，该优势随训练数据量增加和剪枝比例提高而减弱，尤其在高剪枝率下几乎消失。而在允许从头训练并使用完整训练管道所消耗的令牌预算时，细粒度剪枝仍具有性能优势，而粗粒度结构化剪枝则可被从头训练完全匹配甚至超越。因此，解决方案的关键在于：在训练资源受限场景下，利用大型预训练模型进行细粒度剪枝是优于从头训练的有效策略；但在训练预算充足时，对于粗粒度剪枝，从头训练已具备竞争力，此时大型预训练模型并非必需。

链接: https://arxiv.org/abs/2606.14150
作者: Yufeng Xu,Taiming Lu,Kunjun Li,Jiachen Zhu,Mingjie Sun,Zhuang Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5–0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.

[NLP-32] Personal Care Utility: Health as Everyday Infrastructure

【速读】：该论文旨在解决当前医疗体系中长期健康管理的基础设施缺失问题：尽管个人在临床场景外的8,759小时（如饮食、睡眠、运动、用药及压力管理）对长期健康具有决定性影响，但现有系统缺乏对这些持续性个人信号进行结构化处理与智能响应的能力。其核心挑战并非数据或推理能力不足，而是缺少一个能够整合、理解并动态响应个体日常健康事件的通用架构。为此，论文提出“个人照护服务单元”（Personal Care Utility, PCU），其关键在于构建一种分层、事件驱动的系统架构，通过“人-体”（Personicle）将连续生理与行为信号转化为语义上可解释的生活事件，基于个体基线动态评估健康状态，结合因果推理与上下文分析，由协调器（orchestrator）分离临床决策逻辑、行为策略选择与自然语言表达三个模块。该解耦设计使大语言模型（Large Language Models, LLMs）可用于增强推理与沟通能力，同时确保安全关键的临床决策仍基于经过验证的医学证据。以2型糖尿病为例，PCU将连续血糖监测（CGM）、进食、活动、用药、睡眠、压力等多源数据转化为血糖事件、个性化状态估计、因果解释与知识驱动干预。案例演示表明，同一基础设施可根据情境与风险等级，实时生成提醒、周报、药物核查提示、静默处理或确定性安全警报。最终，PCU不仅为慢性病管理提供可扩展的技术蓝图，更将个性化从“末端通信层”升维为“日常健康指导的架构属性”，并引发对始终在线的个人健康服务所涉及治理与隐私等关键问题的思考。

链接: https://arxiv.org/abs/2606.14145
作者: Mahyar Abbasian,Elahe Khatibi,Saba A. Farahani,Nitish Nagesh,Arshia Ilaty,Hooman Sajjadi,Amir Rahmani,Ramesh Jain
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Healthcare is essential, expert, and episodic by design - built around the roughly one hour per year a person spends with a clinician. The 8,759 hours outside clinical settings, where eating, sleeping, movement, medication, and stress actually shape long-term health, have no comparable infrastructure. The bottleneck for personalized health is not raw data or reasoning capability; it is the absence of that infrastructure layer. This paper introduces the Personal Care Utility (PCU): a layered, event-driven architecture proposed as the missing utility for everyday health, in the way that payments, networks, and power are utilities for their domains. PCU organizes continuous personal signals into semantically meaningful life events through a Personicle, estimates dynamic health state against personal baselines, reasons about cause and context, and routes guidance through an orchestrator that separates clinical decision logic, behavioral strategy selection, and natural-language expression. This separation lets large language models support reasoning and communication while keeping safety-critical clinical decisions grounded in validated evidence. We instantiate PCU for Type 2 Diabetes - turning CGM, meal, activity, medication, sleep, stress, and clinical data into glycemic events, individualized state estimates, causal explanations, and knowledge-grounded interventions. A day-in-the-life scenario shows the same infrastructure producing real-time nudges, weekly summaries, medication check-ins, silence, or deterministic safety alerts depending on context and risk. We close with how PCU generalizes to other chronic conditions and the governance questions any always-on personal health utility must address. The result is a blueprint that treats personalization not as a final messaging layer, but as an architectural property of everyday health guidance.

[NLP-33] Implicit Reasoning for Large Language Model-based Generative Recommendation

【速读】：该论文旨在解决生成式推荐（Generative Recommendation, GR）中基于大语言模型（LLM）的知识调用难题，核心问题在于：当前主流方法采用语义标识符（Semantic IDs, SIDs）表示物品，而这些符号在LLM预训练阶段未出现，导致其无法通过自然语言推理接口有效利用预训练的世界知识。现有解决方案依赖于复杂的多阶段显式推理流程，需进行推理轨迹获取与对齐训练，但此类方法不仅成本高昂，且缺乏对各阶段必要性的清晰理解。本文系统剖析了显式推理训练范式的三大关键局限：世界知识的表述能力减弱、SIDs与自然语言嵌入空间间的错位，以及对推理质量的高度敏感性，均严重制约了显式推理性能。为此，论文提出一种轻量级隐式推理框架PauseRec，其核心创新在于摒弃了昂贵的推理轨迹采集与对齐训练，直接在无需显式推理链的情况下实现高效推荐。该方案显著提升了推荐效果（相比标准显式思维链方法最高提升6.22%），同时降低高达65%的训练开销（GPU小时），并使推理速度提升达71.3%。因此，PauseRec为生成式推荐提供了一种更高效、更实用的替代路径，实现了在不牺牲性能的前提下大幅提升计算效率与可部署性。

链接: https://arxiv.org/abs/2606.14142
作者: Yinhan He,Liam Collins,Bhuvesh Kumar,Jundong Li,Neil Shah,Donald Loveland
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key obstacle is that LLM-based GR typically represents items with Semantic IDs (SIDs), disrupting LLMs’ natural-language reasoning interface because these tokens are unseen by the LLM during pretraining. Existing approaches address this with expensive multi-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary. In this work, we systematically decompose explicit reasoning training pipelines for LLM-based GR, revealing three key limitations: weakened world-knowledge verbalization, misalignment between SID and natural-language token embedding spaces, and sensitivity to rationale quality, all of which hurt explicit reasoning performance. To circumvent these issues, we propose PauseRec, a lightweight implicit reasoning paradigm tailored for GR. PauseRec is exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: (1) it outperforms standard explicit CoT methods by up to 6.22%, (2) it reduces training cost by up to 65% GPU hours, and (3) it speeds up inference by up to 71.3%. These results position PauseRec as a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM-based GR.

[NLP-34] Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

【速读】：该论文旨在解决当前音频-语言模型在处理声音事件时存在的语义与空间定位能力不足的问题。现有模型通常将音频片段视为全局事件内容，缺乏对声音事件在时空维度上的精细建模；而现有的声音事件定位模型虽能追踪声源方向随时间的变化，但其语义覆盖范围有限，难以支持深层次的语言推理。为弥补这一空白，研究提出ST-AudioQA数据集与基准测试，基于一阶球谐编码（first-order ambisonic, FOA）渲染生成静态与动态声源场景，提供包括声源身份、活动状态、方向、距离及运动轨迹在内的丰富元数据，从而实现密集轨迹监督并支持关于“何物发声”“何处发声”“如何运动”以及“声源间关系”的复杂问答任务。解决方案的关键在于提出ST-Audio Encoder，一种具备时间分辨能力的FOA音频编码器，能够联合学习声音事件的语义信息与声源轨迹；同时构建ST-AudioLM，将编码器输出的音频标记与大语言模型（LLM）连接，实现跨模态的时空音频问答。实验表明，该方法显著提升了语义理解与空间定位之间的平衡性，并在多项指标上优于静态空间表征与以定位为导向的基线模型。

链接: https://arxiv.org/abs/2606.14141
作者: Oh Hyun-Bin,Kazuki Shimada,Yuhta Takida,Kim Sung-Bin,Toshimitsu Uesaka,Takashi Shibuya,Kyeongyoon Lee,Tae-Hyun Oh,Yuki Mitsufuji
机构: Sony AI(索尼人工智能); Sony Group Corporation(索尼集团); POSTECH(韩国浦项科技大学); Sungkyunkwan University(成钧馆大学); KAIST(韩国科学技术院)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.

[NLP-35] Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

【速读】：该论文旨在解决生成式AI在处理多语言文本时因字节级分词（byte-level tokenization）导致的UTF-8编码结构有效性问题，即模型在面对罕见或未见字符时可能生成无效UTF-8序列。其核心解决方案在于揭示训练规模与UTF-8生成可靠性之间的非对称关系，并提出超越传统困惑度（perplexity）评估的多维度评价协议，以独立衡量语言建模能力与编码结构有效性。关键发现表明，尽管困惑度在约21亿训练样本后趋于稳定，但UTF-8结构有效性需约42亿样本才能收敛，且在无上下文生成场景下，生僻字符的编码有效性反而高于常见字符，暗示高频字符表示存在过度特化现象。因此，可靠生成有效UTF-8序列是一种独立于语言建模能力的必要属性，需通过专门设计的评估框架进行验证。

链接: https://arxiv.org/abs/2606.14122
作者: Sangwhan Moon,Daisuke Oba,Youmi Ma,Tatsuya Hiraoka,Naoaki Okazaki
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.

[NLP-36] Simulating Students Java Programming Errors with Large Language Models

【速读】：该论文旨在解决编程教育中学生代码错误数据获取成本高、耗时长的问题，即在新设计的编程任务尚未经过大规模课堂实践前，难以获得具有代表性的学生错误样本。其核心解决方案是利用大语言模型（Large Language Models, LLMs）作为可扩展的学生行为代理，通过模拟真实且多样化的逻辑错误来生成合成错误数据。研究的关键在于评估不同LLMs在三种主流提示策略（输入-输出、思维链、迭代自修正）下生成错误的多样性与与真实学生错误的对齐度，并揭示错误特征随题目难度变化的规律。结果表明，尽管所有模型均能生成多样化错误，但仅Claude Sonnet 4在多样性与真实性之间达到最佳平衡；盲评专家分析进一步验证了合成错误在功能上与真实学生错误无法区分。研究还发现，高难度任务虽引发更丰富的错误模式，但其错误更偏离真实学生表现，揭示了使用LLMs模拟学习者时的内在权衡，为智能辅导系统、可教代理及大规模学习分析中的合成错误数据设计提供了重要依据。

链接: https://arxiv.org/abs/2606.14113
作者: Ali Keramati,Jie Cao,Iman Mohammadi,Mark Warschauer,Yang Shi
机构: Virginia Tech (弗吉尼亚理工大学); Carnegie Mellon University (卡内基梅隆大学); Utah State University (犹他州立大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding student errors in the programming is a cornerstone of programming education, yet obtaining a representative set of student errors for any newly designed task remains slow and costly, since authentic submissions only accumulate after extensive classroom deployment. This paper explores whether large language models (LLMs) can serve as scalable proxies for students by simulating realistic logical errors in code submissions. Using the CodeWorkout dataset of 74,000+ unique student Java submissions across 37 problems, we evaluate five LLMs under three mainstream prompting strategies: Input-Output (IO), Chain-of-Thought (CoT), and iterative Self-Refine. We assess performance along two key dimensions: diversity (the range of distinct error patterns) and alignment (alignment with authentic student mistakes), and examine how these vary by struggling level of programming tasks. Our quantitative findings reveal that while all models generate diverse errors, their alignment to human submissions diverges: Claude Sonnet 4 achieves the most balanced performance. In addition, we conducted a blinded expert annotation study (N = 401) comparing synthetic and authentic errors. This qualitative analysis confirms that the generated errors are functionally indistinguishable from authentic student errors. Moreover, higher-struggling-level problems elicit more diverse but less student-like errors. These results highlight trade-offs in using LLMs to simulate human learners and suggest design considerations for integrating synthetic errors into teachable agents, intelligent tutoring systems, and large-scale learning analytics.

[NLP-37] Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

【速读】：该论文旨在解决儿童脑肿瘤分割中存在的多重挑战，包括标注数据有限、影像表型异质性高、肿瘤边界模糊以及不同肿瘤亚区域之间的类别不平衡等问题。其核心解决方案是提出一种两阶段深度学习框架：首先利用3D Res U-Net与Swin-UNETR等基线模型在多模态儿科脑MRI数据上进行粗略分割，识别肿瘤核心、全肿瘤及增强肿瘤区域；随后引入基于扩散模型的精修模块（如3D DDPM和MedSegDiff），以粗分割结果为条件进行精细化边界优化。关键创新在于条件化扩散模型设计，显著提升了扩散过程的稳定性与分割性能，尤其在增强肿瘤边界的精确刻画方面表现优异，其中条件化MedSegDiff在边界一致性指标（HD95）上达到最优。最终，将分割结果与多模态语言模型结合，生成结构化的放射科报告，实现了从图像分割到临床解读的端到端可解释性人工智能辅助神经肿瘤学工作流。

链接: https://arxiv.org/abs/2606.14072
作者: Wentao Ke,Jianche Liu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage deep learning framework for improving multi-modal pediatric brain MRI segmentation and clinical interpretation. First, we evaluate 3D Res U-Net and Swin-UNETR baselines on BraTS-PEDs MRI scans, using four co-registered modalities to predict tumor core, whole tumor, and enhancing tumor regions. Second, we introduce diffusion-based refinement models conditioned on coarse Swin-UNETR predictions, including a 3D DDPM refiner and MedSegDiff. Conditioning substantially improves diffusion stability and performance, particularly for enhancing tumor boundary segmentation. Conditioned MedSegDiff achieves the strongest boundary agreement with the lowest HD95. Finally, predicted tumor volumes and representative segmentation overlays are integrated with a multimodal language model to generate structured radiology-style reports. Together, our results suggest that coarse-to-refined diffusion segmentation can improve pediatric tumor boundary delineation and support end-to-end interpretable AI-assisted neuro-oncology workflows.

[NLP-38] Harsher on Male? Evaluating LLM s on Gender-Asymmetric Moral Framing Across Diverse Conflict Scenarios

【速读】：该论文旨在解决大语言模型（LLM）在处理性别相关情境时是否存在隐性偏见的问题，具体聚焦于同一负面行为在男性与女性行为主体条件下是否受到一致的回应标准。现有研究多关注刻板印象、职业关联或显性有害输出，而本研究揭示了更深层次的系统性偏差：即模型对相同不当行为的反应存在显著性别不对称。其解决方案的关键在于构建GAMA-Bench——一个包含1,298个场景的性别镜像基准测试集，覆盖亲密关系与公共社会冲突两类情境；通过受控网格设计与跨模型评审生成性别中立的违规行为模板，并将其转化为成对的第一人称提示，保持角色参照与性别变量匹配。研究进一步设计结构化响应框架，量化模型在惩罚、共情、升级、指令与责备等维度的分配差异。实验结果表明，所有测试的10个代表性大语言模型均表现出一致的“男性不利”偏向：对男性行为主体采用更严厉、更具升级倾向及责备导向的回应，而对女性则更多呈现治疗性与共情导向的回应。该现象在不同模型家族、场景类别、参数规模及显式推理模式下均持续存在，凸显了当前大语言模型在性别公平性方面存在的深层系统性缺陷。

链接: https://arxiv.org/abs/2606.14068
作者: Guangzong Si,Dong Wang,Zhenhao Li,Yifan Yu,Panwang Pan,Wentao Zhu
机构: University of Science and Technology of China (中国科学技术大学); Eastern Institute of Technology, Ningbo (宁波东方理工大学)
类目: Computation and Language (cs.CL)
备注: underreview

点击查看摘要

Abstract:Existing studies on gender bias in LLMs have largely focused on stereotypes, occupational associations, or explicit harmful outputs. In this work, we ask whether LLMs apply consistent response standards to the same negative behavior under matched male-actor and female-actor conditions. We introduce GAMA-Bench, a gender-mirrored benchmark of 1,298 scenarios covering intimate relationship and public social conflicts. It constructs gender-neutral misconduct templates through controlled grids and cross-model review, then compiles them into paired first-person prompts with matched actor-gender and role-reference variations. We further design a structured response-framing protocol to measure how models allocate punishment, empathy, escalation, instruction, and blame. Experiments on 10 representative LLMs reveal a consistent male-disadvantaging asymmetry: male actors receive more punitive, escalatory, and blame-centered framing, whereas female actors receive more therapeutic and empathy-oriented framing for the same misconduct. Further analyses show that this pattern persists across model families, scenario tracks, model scale, and explicit thinking-style reasoning. The official code is available at this https URL.

[NLP-39] Non-Parametric Machine Text Detection via Multi-View Gaussian Processes

【速读】：该论文旨在解决对抗性条件（如改写攻击和目标风格迁移）导致机器生成文本检测器准确率显著下降的问题。现有方法依赖参数化分类器在充分监督下融合多种互补信号（如风格特征、似然与排序特征、结构特征），但在分布外（如新型攻击或未见语言模型）情况下易产生自信但错误的预测，缺乏可靠性。本文提出一种多视角非参数化检测框架，从同一文档中提取多个互补特征视图，并通过高斯过程集成（Gaussian Process Ensemble）聚合各视图的证据。其核心创新在于：通过跨视图证据聚合，使攻击者必须同时攻破多个独立检测轴，显著提高逃避成本；同时，高斯过程提供校准的概率输出和对分布外输入的合理拒识能力，增强了在高风险场景中的部署可靠性。实验在DetectRL、RAID及PAN2025共享任务三个涵盖多样生成器与攻击方式的基准上验证了该方法的有效性，结果表明其在面对未见过的攻击时仍保持优异性能，优于现有方法。

链接: https://arxiv.org/abs/2606.14060
作者: Aleem Khan,Nicholas Andrews
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adversarial conditions such as paraphrasing and targeted style transfer sharply degrade the accuracy of machine text detectors. A document, however, carries multiple complementary signals (e.g., stylistic features, likelihood and rank-order features, and structural features), and an attack that suppresses one may leave others intact. While a parametric classifier can learn to combine these features given sufficient supervision, classifiers are prone to making confidently incorrect predictions when the distribution shifts (e.g., novel attacks or unseen language models). To address this, we propose a multi-view, non-parametric detection framework that extracts complementary feature views from the same document and aggregates per-view evidence through a Gaussian process ensemble. By aggregating evidence across views, an adversary must simultaneously defeat multiple independent axes of detection, substantially raising the cost of evasion. The Gaussian process formulation additionally provides calibrated probabilities and principled abstention on out-of-distribution inputs, supporting reliable deployment in high-stakes settings. We evaluate on three benchmarks spanning diverse generators and attacks: the DetectRL and RAID benchmarks, and the PAN2025 shared task and demonstrate that our multi-view detector maintains strong performance under the considered attacks, outperforming existing approaches against held out attacks.

[NLP-40] Right or Wrong Models Comply: Directional Blindness in LLM Moral Judgment

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在面对用户施加压力时缺乏方向性选择性的问题，即现有评估方法通常仅衡量模型是否抵抗外部压力，而未考察其是否能区分有益与有害的诱导。为此，作者提出“合规不对称性”（Compliance Asymmetry, A = BCR/HCR）这一双向诊断指标，通过对比模型在有益提示（helpful nudges）下产生有益输出变化的比例与在误导性提示（misleading nudges）下产生有害输出变化的比例，以评估其响应的选择性。实验结果表明，在事实类判断中，模型对有益提示的遵从度显著高于有害提示（A = 1.58），但在道德判断中则表现出近乎对称的遵从行为（A = 1.04），且该现象在不同模型家族、能力水平及提示类型下均稳定存在。此外，链式思维提示（chain-of-thought prompting）会同时增强有益与有害的合规性，而基于身份的提示（identity-based prompting）则以相近程度抑制两者。研究揭示了当前大语言模型在道德判断中存在“方向盲目的合规性”这一独特失效模式，指出模型对齐应聚焦于实现方向校准的动态更新，而非单纯降低整体合规性。

链接: https://arxiv.org/abs/2606.14037
作者: Jihye Kim,Jeffrey Flanigan
机构: University of California, Santa Cruz
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges. Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1.58), but follow both directions at nearly identical rates on moral questions (A = 1.04). This phenomenon persists across model families, capability levels, and nudging types. Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.

[NLP-41] Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

【速读】：该论文旨在解决在资源受限硬件上部署实时流式说话人分离（streaming speaker diarization）时面临的模型体积与计算效率之间的矛盾问题，尤其针对医疗调度这类对时间敏感的应用场景。其核心挑战在于如何在保证较低延迟的前提下，实现模型的小型化与高效推理。解决方案的关键在于通过剪枝（pruning）和低比特量化（low-bit quantization）对分割模型进行压缩，并基于模拟医疗调度对话数据集SIMSAMU系统评估不同流式延迟预算下的性能表现。研究发现，增加缓冲并非始终有益，极低延迟设置会显著降低性能；而采用半精度浮点（FP16）量化可在模型大小减少一半的同时保持几乎不变的实时因子（real-time factor），但代价是相对错误率（DER）上升约40%。该工作明确了模型压缩中的性能-资源权衡关系，为实时语音技术在高时效性场景中的可靠应用提供了关键指导。

链接: https://arxiv.org/abs/2606.14030
作者: Rishit Chatterjee,Tahiya Chowdhury
机构: Colby College(科尔比学院)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: 6 pages, 3 figures, preprint

点击查看摘要

Abstract:Streaming speaker diarization is crucial for time-critical medical dispatch, but deploying it on resource-constrained hardware requires smaller, faster models. Using SIMSAMU, a dataset of simulated medical-dispatch conversations, we evaluate streaming behavior before compressing the segmentation model with pruning and low-bit quantization. We characterize performance across a range of streaming latency budgets and find that additional buffering is not consistently beneficial, while very low-latency operating points can substantially degrade performance. Our study shows that model compression trades performance for memory footprint, and we highlight an operating point where FP16 reduces model size by half with essentially unchanged real-time factor, at a cost of a 40% relative DER increase against the baseline. This work characterizes the trade-offs for real-time deployment and contributes to speech technology that can enable reliable human communication in time-critical contexts.

[NLP-42] Same-Origin Policy for Agent ic Browsers

【速读】：该论文旨在解决生成式 AI (Generative AI) 与浏览器深度融合背景下，同源策略（Same-Origin Policy, SOP）在智能代理浏览器（agentic browsers）中是否仍能有效防范跨源数据泄露这一关键安全问题。其核心挑战在于，智能代理浏览器本身具备自动化执行任务的能力，可能被滥用为绕过SOP的隐蔽跨源数据传输通道，从而导致隐私泄露或恶意信息外传。解决方案的关键是提出SOPGuard，一种专为智能代理浏览器设计的SOP强制执行机制，通过在BrowserOS开源框架中实现并验证，能够在保持任务执行功能完整性的同时，有效阻止跨源数据流，且仅引入可接受的运行时开销。该方案通过构建SOPBench基准测试平台系统性地评估了现有智能代理浏览器的SOP脆弱性，为后续安全增强提供了坚实基础。

链接: https://arxiv.org/abs/2606.14027
作者: Xilong Wang,Xiaoxing Chen,Patrick Li,Dawn Song,Neil Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at this https URL.

[NLP-43] Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

【速读】：该论文旨在解决现有编程代理（coding agents）评估体系与实际应用场景脱节的问题，即当前主流基准测试将编程代理视为完全自主系统进行评估，而忽视了其在真实软件工程场景中通过与用户对话协作解决问题的实际能力。为此，研究提出Dialogue SWE-Bench——一个面向真实世界软件工程问题的自动化对话评估数据集，用于衡量编程代理通过与用户交互来解决复杂任务的能力。其解决方案的关键在于设计了一种基于角色设定（persona-grounded）的用户模拟器，以更真实地模拟人类开发者在开发过程中的行为与反馈，并引入自动化的对话质量评估机制，从而全面衡量代理的对话表现。此外，研究还提出一种新型的结构化引导型代理（schema-guided agent），通过显式建模对话上下文与任务目标之间的关系，显著提升了现成编程代理的对话能力，在多个指标上优于强基线模型3%-14%。实验结果表明，代码生成能力较强的模型未必具备良好的对话能力，揭示了对话能力是编程代理性能中一个独立且尚未充分研究的重要维度。

链接: https://arxiv.org/abs/2606.13995
作者: Brendan King,Jeffrey Flanigan
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computation and Language (cs.CL)
备注: 22 pages, 13 figures

点击查看摘要

Abstract:AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.

[NLP-44] he Holistic Storag e of VerbUp Phrases in Text-based and Audio-based Language Models

【速读】：该论文旨在解决语言模型在处理多词单位（multi-word units）时，如何在已存储的固定表达与基于生成规则的抽象知识之间实现动态平衡的问题。其核心挑战在于揭示语言模型是否具备对高频、高可预测性短语（如“V+up”动词短语）进行整体性表征（holistic storage）的能力，从而体现使用基础理论（usage-based theory）的语言习得机制。解决方案的关键在于通过分析文本型大语言模型（LLM）和自动语音识别（ASR）模型的内部表示，系统检验这些短语是否因频率与可预测性而形成独立的、非分解的表征模式。研究发现，所有模型均表现出由频率和可预测性驱动的整体性存储证据，表明语言模型能够像人类一样，将常用短语作为整体单元进行存储与调用，这为理解语言模型中“记忆”与“生成”之间的协同机制提供了实证支持。

链接: https://arxiv.org/abs/2606.13993
作者: Zachary Nicholas Houghton,Yu Zhou,Dan Pluth,Vijay K. Gurbani
机构: University of Oregon; Vail Systems, Inc
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A crucial aspect of linguistic capability is the ability to trade off between stored representations and abstract knowledge: one must retrieve learned representations, but also generate novel ones by applying productive rules. While recent work has examined abstract knowledge in language models, holistic storage of multi-word units has received far less attention. We probe internal representations in text-based LLMs and an ASR model, testing whether V+up phrasal verbs develop distinct representations as a function of frequency and predictability. All models show evidence of holistic storage driven by frequency and predictability, further supporting usage-based theories of language.

[NLP-45] Fusing Stylometric and Embedding Systems to Estimate Authorship Likelihood Ratios in Japanese

【速读】：该论文旨在解决跨语言背景下文本作者归属（authorship attribution）中证据分析的合法性与有效性问题，特别是将广受认可的似然比（likelihood ratio, LR）框架从英语语境拓展至日语数字文本的适用性难题。其核心挑战在于：传统基于风格特征（stylometric features）的分析方法虽已广泛应用，但难以充分捕捉上下文语义信息；而新兴的预训练大语言模型生成的上下文嵌入（contextual embeddings）虽具备更强语义表征能力，却尚未在似然比框架内实现与风格特征系统的有效融合。本文的关键解决方案是首次在似然比框架下实现风格特征系统与嵌入式系统（embedding-based systems）的融合，通过对约1000字符的日语博客文本进行实验，验证了融合系统在保持良好校准性的同时，显著提升了真实情境下的似然比幅度、降低了虚假情境下的似然比幅度，并整体增强了判别能力。最优融合方案达到0.32484的对数似然比代价，证明了似然比框架在日语文本中的可行性及异构系统融合的优越性。

链接: https://arxiv.org/abs/2606.13991
作者: Praju Ghatpande,Satoru Tsuge,Shunichi Ishihara,Wataru Zaitsu,Mitsuyuki Inaba
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The likelihood ratio framework is widely recognized as the logically and legally sound basis for evidential analysis across forensic sciences, and its importance is increasingly acknowledged in analyses of authorship in textual evidence. To date, however, its application has been confined to English-language texts. Meanwhile, authorship attribution has traditionally relied on a diverse array of stylometric features, even as the rise of pre-trained large language models enables new contextual-embedding approaches. Combining these diverse approaches through fusion promises enhanced performance, yet it has not been applied to integrate stylometric-feature systems with embedding-based systems within the likelihood ratio paradigm. This study is the first to apply likelihood ratio-based forensic text comparison to Japanese digital texts, using ~1,000-character excerpts from blogs, to 1) evaluate system performance and likelihood ratio magnitudes and 2) assess the impact of fusing stylometric-feature systems with embedding-based systems. The results demonstrate that the fused system maintains excellent calibration while 1) increasing consistent-with-fact likelihood ratio magnitudes; 2) decreasing contrary-to-fact likelihood ratio magnitudes and 3) improving overall discriminability. The best-performing fusion achieved a log-likelihood-ratio cost of 0.32484, illustrating both the feasibility of likelihood ratio framework for Japanese and the benefits of fusion across heterogeneous systems.

[NLP-46] Creative Integration: A Decidable Criterion of Creativity

【速读】：该论文旨在解决“整合”（integrative）概念在创造性研究中缺乏明确操作定义的问题，即如何区分真正的创造性整合（Creative Integration, CI）与表面的、仅形式上整洁的重新描述。其核心问题在于：现有方法无法可靠判断某一整合是否真正压缩了信息——即是否在不损失语义的前提下显著缩短了对复杂冲突的描述长度。为此，作者基于“创造力即压缩”（creativity as compression）的理论框架，提出一个可判定的创造性整合标准：当且仅当在固定描述语言下，整合前后整体描述长度严格缩减（压缩比 C = L_pre / L_post > 1），且该压缩效应集中于原始冲突本身时，该整合才被视为真正的创造性整合。解决方案的关键在于构建四个逻辑连贯的二元判别门（binary conjunctive gates）以实现判断的可计算性，并通过一个伪整合分类体系排除形似但非真实的案例。研究进一步通过四项可证伪的实证测试——独立计算验证、对困难负样本的区分能力、跨样本预测性能以及描述语言鲁棒性——对标准进行验证，所有测试均显著通过。因此，论文的核心贡献并非重申“创造力即压缩”的理念，而在于提供了一个可引用、可验证、具备实证基础的判别准则，从而为更广泛的创造性计算研究奠定基础。

链接: https://arxiv.org/abs/2606.13977
作者: Yoshinori Nomura
机构: Mirage Mountain Technologies(幻影山科技)
类目: Computation and Language (cs.CL)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:“Integrative” solutions are widely praised but rarely defined: we lack an operational way to tell a genuine integration – one that makes the world cheaper to describe – from a tidy re-description. Building on the lineage that treats creativity and intelligence as compression, we give such a criterion for creative integration (CI): the resolution of a real conflict between A and B is CI if and only if, under a fixed description language, the description length strictly shrinks (C = L_pre/L_post 1), with the reduction located in the conflict itself. We make the judgment decidable through four binary, conjunctive gates, and we fix its extension through a taxonomy of pseudo-integration that names and rejects the look-alikes. We back the criterion with a curated, multi-domain corpus and – crucially – validate it not by human inter-rater agreement but by four falsifiable tests it could fail: an independent computational check, discrimination against hard negatives, out-of-sample prediction, and description-language robustness; all pass with margin. The contribution is not “creativity is compression” but its decidability, discrimination, and corpus: on this account, what makes a move genuinely creative – rather than merely novel – is that it compresses a conflict, with novelty and value as downstream symptoms; whether all creativity is so constituted we state as an explicit conjecture. We claim only the sign of C-1; we judge, not generate. The result is a citable primitive for a broader program. Comments: 18 pages, 1 figure Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7; I.2.0; F.4.1 Cite as: arXiv:2606.13977 [cs.CL] (or arXiv:2606.13977v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.13977 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yoshinori Nomura [view email] [v1] Thu, 11 Jun 2026 23:49:25 UTC (26 KB) Full-text links: Access Paper: View a PDF of the paper titled Creative Integration: A Decidable Criterion of Creativity, by Yoshinori NomuraView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-47] MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

【速读】：该论文旨在解决罕见病跨机构诊断中因数据隐私限制导致的诊断能力受限问题。由于单一医疗机构难以积累足够病例以实现可靠诊断，跨医院协作成为提升诊断准确性的关键路径，但受隐私法规约束，不可直接传输包含患者身份信息的临床文本。现有医疗智能体系统多依赖文本证据交换，而原始隐状态（如隐藏状态与键值缓存，KV caches）仍可能泄露源自提示词的临床内容，存在隐私泄露风险。为此，本文提出MedLatentDx——一种基于隐状态的多智能体通信框架，其核心创新在于：各医院智能体本地保留私密临床记录与检索到的病例，仅向中心智能体发送紧凑的隐式键值块（latent KV blocks），从而实现高效、安全的跨机构知识共享。该框架支持两种部署模式：同架构模型采用隐式键值蒸馏（latent KV distillation），异构大模型则通过跨家族隐状态对齐（cross-family latent alignment）实现兼容。在自建的大规模罕见病基准CrossRare-Bench（按医院层级划分）上，MedLatentDx显著提升了跨机构诊断性能，同时相比原始隐状态通信基线，有效降低了可重建的临床内容暴露风险，实现了诊断效能与隐私保护的平衡。

链接: https://arxiv.org/abs/2606.13945
作者: Ziqing Wang,Lili Zhao,Kaize Ding
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rare diseases affect over 300 million patients across more than 7,000 conditions, yet no single hospital encounters enough cases of any one condition for reliable diagnosis. Cross-hospital collaboration could help by allowing a diagnosing institution to use distributed, case-specific diagnostic evidence, but privacy regulations restrict the transmission of identifiable clinical text across institutional boundaries. This setting raises two challenges: existing medical agent systems often rely on textual evidence exchange, while raw latent states such as hidden states and KV caches may still reveal prompt-derived clinical content. We introduce MedLatentDx, a latent multi-agent communication framework in which hospital agents keep private clinical records and retrieved cases local, and send compact latent KV blocks to a host agent for rare-disease diagnosis. MedLatentDx supports two deployment settings: same-backbone hospital agents use latent KV distillation, while hospitals with different LLM backbones use cross-family latent alignment. On CrossRare-Bench, a self-built large-scale rare-disease benchmark with hospital-level partitions, MedLatentDx improves cross-hospital diagnostic performance while reducing reconstructable clinical content relative to raw-latent communication baselines.

[NLP-48] LLM s Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）评估中一个关键问题：现有研究普遍将模型的偏好与价值体系视为稳定、固定的模型级属性，但缺乏对这些属性在不同任务上下文（deployment context）下鲁棒性的系统性检验。其核心挑战在于，现有评估多依赖于语法变异或选项重排等微小提示扰动，而未能考察当高阶任务语境（如撰写Reddit帖子或新闻稿）发生显著变化时，模型的决策是否仍保持一致。本研究通过两个成熟的成对比较范式——国家偏好排序与效用判断——直接操控部署上下文作为受控变量，基于五种主流大模型和超过120万次成对决策的实证分析表明，上下文变化带来的偏差远超提示改写或温度调节的影响。在15个国家的偏好排名中，上下文导致广泛且统计显著的排名变动，先前报道的“全球北方偏好”本身即具有上下文依赖性；在50个结果的效用评估中，跨类别整体排序虽相对稳定，但领域内细粒度排名波动剧烈，且不同结果间的基数交换率（如区域间生命价值换算）中位数变化达2.47倍。因此，研究结论指出，模型所表现出的偏好与效用并非固定不变的模型属性，而是高度依赖于具体上下文的条件化测量结果。这一发现意味着，基于特定语境建立的安全保障措施无法普适推广至其他任务场景，亟需重新审视模型评估范式中对上下文敏感性的考量。

链接: https://arxiv.org/abs/2606.13944
作者: Filip Trhlik,Aoife O’Flynn,Angela Yu,Arduin Findeis,Paula Buttery
机构: University of Cambridge; ALTA Institute; Leverhulme Centre for the Future of Intelligence; Microsoft UK
类目: Computation and Language (cs.CL)
备注: 68 pages, 54 figures, 54 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations such as syntax variation and option reordering. This leaves open whether the measured properties survive when the surrounding task context changes, as it does in most real deployments. We test this directly across two established pairwise paradigms: ranking country preferences and eliciting utility judgements. In both, we make the deployment context – the high-level task the model is performing while making concrete value-dependent choices – our controlled variable, varied across framings such as writing a Reddit post or a news article. Across five LLMs and over 1.2M pairwise decisions, deployment context produces variation far larger than prompt paraphrasing and temperature controls. In country preference rankings over 15 countries, context induces widespread, statistically significant rank shifts; the aggregate Global North favouritism reported in prior work is itself context-dependent, with each model’s bias shifting systematically across contexts. In utility elicitation over 50 outcomes, broad cross-category ordering is preserved, but fine-grained rankings within domains vary substantially, and cardinal exchange rates between outcomes (e.g. how many lives in one region equal one in another) shift by a factor of 2.47 at the median. Reported model-level preferences and utilities are therefore better understood as context-conditioned measurements than fixed model-level properties: safety guarantees obtained under one framing provide limited assurance in another.

[NLP-49] Can Post-Training Turn LLM s into Good Medical Coders? An Empirical Study of Generative ICD Coding

【速读】：该论文旨在解决生成式大语言模型（LLM）在国际疾病分类编码（ICD）任务中表现不佳的问题，尤其关注现有研究多集中于推理阶段的提示工程（prompting）、检索或工具调用等方法，而忽视了针对特定任务的后训练（post-training）策略对模型性能的关键影响。其解决方案的核心在于系统性地评估不同后训练范式对生成式ICD编码器性能的影响，包括提示工程、监督微调（SFT）与基于强化学习的通用奖励优化（GRPO）。研究首次在统一协议与评估指标下对比了判别式基线与生成式模型的表现，揭示了仅依赖提示工程会显著低估生成式模型的潜力。关键发现表明，监督微调带来主要性能提升，GRPO进一步优化了代码集预测能力，而提出的诊断课程（PHI）通过针对性优化遗漏编码案例，在宏观层面实现了显著增益。结果表明，生成式框架本身并非主要瓶颈，真正限制性能的是模型如何通过适配与优化实现全税目召回率的最大化。

链接: https://arxiv.org/abs/2606.13940
作者: Ziqing Wang,Weihao Li,Shijie Chen,Yuan Luo,Kaize Ding
机构: Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at this https URL.

[NLP-50] DLawBench: Evaluating LLM s Through Multi-Turn Legal Consultation

【速读】：该论文旨在解决现有法律大模型评估基准在模拟真实律师-客户互动方面存在的不足，特别是忽视了模型在多轮对话中主动获取关键事实、根据客户个性进行有效引导的交互能力。其核心挑战在于如何让生成式 AI (Generative AI) 在复杂、非标准化的法律咨询场景中，不仅具备扎实的法律推理能力，还能动态适应不同客户行为模式（如合作型、依赖型、退缩型和对抗型），实现精准的事实挖掘与策略引导。为此，研究提出DLawBench——一个基于真实案例的诊断性基准，涵盖中美国别法律共461个案例、超过5,500对事实条目及数千项问答与问题解决评估标准，系统评估26个代表性大模型的表现。实验揭示出显著的性能提升空间：最先进模型GPT-5.5在咨询基础法律推理任务上仅达0.562得分；更关键的是，该基准暴露了模型在面对最需要引导的客户时表现反而下降的“悖论”现象，并揭示出模型存在迎合客户倾向（sycophancy）的问题，凸显了当前生成式 AI 在真实法律服务场景中仍面临重大挑战。

链接: https://arxiv.org/abs/2606.13931
作者: Li Zhang,Yuzhen Shi,Yiran Hu,Jingwen Zhang,Wenbo Lv,Yubo Ma,Wei Wang,Rongyao Shi,Yuanyang Qiu,Xinran Xu,Yuemeng Qi,Linlin Miao,Jaromir Savelka,Yun Liu,Kevin Ashley,Bing Zhao,Hu Wei,Lin Qu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 37 pages, 8 figures, 26 tables. Code and data: this https URL

点击查看摘要

Abstract:Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their interests. This task requires Large Language Models (LLMs) not only to perform robust legal reasoning, but also to strategically elicit material facts through multi-turn interactions and effectively guide clients with diverse personalities. Yet existing legal benchmarks overlook this interactive capability. To fill this gap, we introduce DLawBench, a diagnostic benchmark for real-world legal consultation. Drawing on realistic client behavior, we characterize lawyer-client interactions into four types: Cooperative, Dependent, Withdrawn, and Adversarial. Using dialogues grounded in real cases, DLawBench evaluates whether LLMs can effectively conduct legal consultation under realistic conditions. DLawBench comprises 461 cases from Chinese and U.S. law, 5,532 paired fact entries, 3,411 inquiry rubrics, and 3,348 issue-resolution rubrics, and evaluates 26 representative LLMs. Systematic experiments show substantial headroom: the best-performing model, GPT-5.5, achieves only 0.562 on consultation-grounded legal reasoning. More importantly, DLawBench exposes both sycophancy in legal consultation and a paradox: models perform worse when clients need guidance most.

[NLP-51] SANA: What Matters for QA Agents over Massive Data Lakes?

【速读】：该论文旨在解决生成式 AI（Generative AI）在数据湖（data lake）环境中进行探索式问答（Exploratory Question Answering, EQA）时，难以准确诊断任务失败具体环节的问题。传统评估仅依赖端到端准确率，无法区分是搜索、规划、数据分析或智能体动作策略（Action Policy）等模块的缺陷所致。为此，作者提出SANA（Search Agent Navigation Ablation）诊断消融框架，将EQA任务转化为包含理想源序列、净化后的子问题及执行记录的运行时轨迹。通过构建理想化的搜索、规划与数据分析组件并逐一消融，SANA能够量化各模块对整体性能的贡献，并以残差差距作为策略失效的诊断依据。其核心在于实现对EQA系统中各关键组件的解耦分析，从而揭示数据湖智能体在实际应用中的瓶颈所在。实验表明，在两个基准测试LakeQA与KramaBench上，数据分析普遍构成主要瓶颈，而搜索在大规模数据湖场景下尤为突出，验证了SANA作为可复用评估框架的有效性与诊断价值。

链接: https://arxiv.org/abs/2606.13904
作者: Austin Senna Wijaya,Jiaxiang Liu,Haonan Wang,Eugene Wu
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End-to-end accuracy alone cannot distinguish failures in search, planning, data analysis, or the agent’s Action Policy: its decisions about what to do next and when to submit an answer. We present SANA (Search Agent Navigation Ablation framework), a diagnostic ablation framework that transforms EQA tasks into runtime profiles containing gold source sequence, sanitized subquestions, and execution records. SANA uses these profiles to construct idealized search, planning, and data-analysis tools, allowing each component to be ablated; the residual gap is diagnostic evidence for policy failures. To illustrate SANA as a reusable evaluation framework, we adapted two recent EQA benchmarks, LakeQA and KramaBench, and evaluated lightweight and mid-sized agents under fixed prompts, budgets, data lakes, and runtimes. Across both benchmarks, data analysis is a consistent bottleneck while planning is less so. Search is a major limitation in LakeQA’s large data-lake setting, but less so for the smaller-scale KramaBench. SANA thus deconstructs end-to-end task accuracies into a diagnosis of where data-lake agents fail, and allows for systematic comparisons of progress in search, planning, data analysis, and agent design. Comments: 9 pages, 7 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2606.13904 [cs.CL] (or arXiv:2606.13904v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.13904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-52] Gefen: Optimized Stochastic Optimizer

【速读】：该论文旨在解决现代深度学习中AdamW优化器内存占用过高的问题，其一阶与二阶矩状态项导致训练内存额外增加约两倍于模型参数大小的缓冲区。为应对这一挑战，论文提出Gefen——一种内存高效的优化器，其核心创新在于：通过共享参数块间的二阶矩估计，并利用基于学习的码本对一阶矩进行量化，从而在不牺牲性能的前提下将AdamW的内存开销降低约8倍（即每十亿参数减少6.5 GiB内存）。该方法的理论基础源于一个关键发现：较大的混合海森矩阵（Hessian）元素会约束平方梯度比值趋近于1，表明与海森矩阵对齐的参数块天然适合作为二阶矩统计量共享的候选对象。由于大规模计算海森矩阵不可行，Gefen通过初始平方梯度推断参数块结构，无需特定架构元数据或额外超参数，仅依赖AdamW默认设置。此外，Gefen采用精确的基于直方图的动态规划量化码本，并复用相同码本进行一阶矩缩放，实现高效压缩。在多种实验场景下，Gefen在所有对比的AdamW类优化器中实现了最低的峰值优化器内存占用，同时保持与AdamW相当的性能表现；在Fully Sharded Data Parallel（FSDP）和Data Parallel（DDP）训练中，显著降低了内存压力，支持更大微批次（microbatch）尺寸，大幅提升吞吐量，可作为无需修改代码的即插即用替代方案，助力训练更大模型或使用更大批量。论文提供了完整的Python实现，包含融合的CUDA内核。

链接: https://arxiv.org/abs/2606.13894
作者: Nadav Benedek,Tomer Koren,Ohad Fried
机构: Reichman University (里希曼大学); Tel Aviv University (特拉维夫大学); Google Research (谷歌研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW’s memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at this https URL

[NLP-53] Natively Unlearnable Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中“遗忘”（unlearning）特定训练数据源的难题，即如何在不重新训练模型的前提下，有效移除模型对某一特定数据源的记忆，同时保持对其他共享知识的保留。传统方法面临的核心挑战在于：不同数据源的信息在模型参数中相互纠缠，难以实现精准分离；若强行将各源信息隔离至独立参数，虽便于删除，却破坏了跨源联合学习的优势。为此，本文提出一种名为NULLs（Natively Unlearnable LLMs）的新模型架构，其关键创新在于引入一组共享的骨干神经元（backbone neurons）与一组稀疏激活的“汇点”（sinks）。在训练过程中，源自特定数据源的信息自然聚集于对应汇点，而跨源共享的知识则集中于骨干网络。部署时，仅需禁用目标源对应的汇点即可完成无梯度更新、无需访问原始数据的高效遗忘。实验表明，该方法可扩展至约600万篇维基百科文章，实现单篇文章级别的精确遗忘，且保留与语义相关文章间的共享知识，性能接近从头训练。此外，NULLs在对抗性提取和逆向重学攻击下表现出强鲁棒性，并维持与标准Transformer相当的语言建模能力。因此，该方案的关键突破在于：通过结构化设计实现了源级遗忘与共享表示学习的天然兼容，使遗忘能力可作为模型训练的原生特性，而非事后补救措施。

链接: https://arxiv.org/abs/2606.13873
作者: Gaurav R. Ghosal,Pratyush Maini,Aditi Raghunathan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unlearning aims to remove the influence of specific training data sources, but this has proved challenging because the contributions of different sources are entangled within the model. Isolating source contributions to disjoint parameters makes removal easier, though it obstructs joint learning across sources. We propose NULLs (Natively Unlearnable LLMs), a model class that satisfies the two opposing goals of isolating source-specific contributions and learning jointly across sources, by training a set of shared backbone neurons alongside a pool of sparsely activated sinks. During training, information specific to a source naturally concentrates in its sinks while information shared across sources accumulates in the backbone. A source is then unlearned at deployment by disabling its corresponding sinks, with no gradient updates and no access to the retained data. We show that NULLs scales to Wikipedia’s ~6M articles, isolating each as an independent source. Unlearning a single article removes knowledge specific to it while preserving facts shared with semantically related articles, closely matching retraining from scratch. We note that unlearning with NULLs is also robust: in a case study of unlearning the Harry Potter books, NULLs resists both adversarial extraction and relearning that reverses post-hoc unlearning. Finally, NULLs preserves general language capabilities, matching a standard transformer on downstream benchmarks. Together, these results suggest that source-level unlearning need not be an afterthought. It can be built natively into LLM training while retaining the benefits of shared representation learning.

[NLP-54] SuperThoughts: Reasoning Tokens in Superposition

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在长链式思维（Long Chain-of-Thought, CoT）推理过程中因逐个生成离散标记（token）而导致的计算开销大、推理效率低的问题。现有方法尝试在连续潜在空间中进行推理以绕过离散token生成，但往往面临训练不稳定以及难以扩展至复杂、长时程任务的挑战，主要由于缺乏有效的监督信号。为此，论文提出SuperThoughts框架，其核心创新在于将连续的两个CoT token对压缩为单一潜在表示，并通过轻量级多标记预测（Multi-Token Prediction, MTP）模块实现每步解码两个token，从而在保持训练阶段离散token监督的同时，显著提升推理阶段的吞吐量（翻倍）。此外，引入基于置信度的自适应机制，在不确定性较高时回退至标准解码模式，确保可靠性。实验结果表明，SuperThoughts在MATH500、AMC、OlympiadBench和GPQA-Diamond等多个数学推理基准上实现了约20%–30%的CoT长度压缩，同时仅带来1–2个百分点的精度下降，有效平衡了效率与准确性。

链接: https://arxiv.org/abs/2606.13862
作者: Zheyang Xiong,Shivam Garg,Max Yu,Vaishnavi Shrivastava,Haoyu Zhao,Anastasios Kyrillidis,Dimitris Papailiopoulos
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Microsoft Research (微软研究院); Princeton University (普林斯顿大学); Rice University (莱斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stability and fail to scale to complex, long-horizon tasks due to lack of supervision signal. We propose SuperThoughts, which compresses pairs of consecutive CoT tokens into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction (MTP) module. This preserves discrete token supervision at training time while doubling throughput at inference time. We finetune Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-14B-Instruct, and evaluate on MATH500, AMC, OlympiadBench, and GPQA-Diamond. With a confidence-based adaptive mechanism that falls back to standard decoding when uncertain, SuperThoughts achieves \sim 20–30% CoT length reduction while maintaining accuracy with minimal degradation (1-2 points accuracy drop on most tasks).

[NLP-55] Hybrid Classical-Quantum Variational Autoencoder for Neural Topic Modeling

【速读】：该论文旨在解决神经主题模型（Neural Topic Models, NTMs）与量子硬件集成不足的问题，尤其是在资源受限的当前量子计算阶段（如NISQ设备）下实现高效、可扩展的主题建模。其核心解决方案是提出一种混合经典-量子变分自编码器（Hybrid Classical-Quantum Variational Autoencoder, VAE），将参数化量子电路嵌入到VAE的推理网络中，同时保留经典的主题-词解码器以确保可解释性与计算效率。为应对量子硬件的资源限制，研究创新性地设计了一种改进的高斯Softmax后验分布，该方法实现了潜在空间维度与待提取主题数之间的解耦，使得模型可在仅10量子比特的低资源量子设备上运行。实验结果表明，该混合模型在AgNews数据集上达到0.71的C_v一致性得分和0.20的NPMI得分，显著优于现有先进神经主题模型，并保持了较高的主题多样性；其全经典对照模型同样表现优异且在潜在空间中展现出清晰的类别分离。这些成果验证了混合量子-经典VAE在NISQ时代具备实际计算可行性，为未来量子增强型主题建模提供了可行路径。

链接: https://arxiv.org/abs/2606.13852
作者: Ivan Kankeu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural topic models enable scalable semantic discovery, but their integration with quantum hardware remains largely unexplored. We present a proof-of-concept hybrid classical-quantum variational autoencoder (VAE) for topic modeling, embedding parameterized quantum circuits within the VAE inference network while retaining a classical topic-word decoder. To address the resource constraints of quantum hardware, we propose a modified Gaussian Softmax posterior that decouples latent space dimensionality from the number of topics to be extracted, enabling the model to operate with a low-resource 10-qubit quantum device. On the AgNews dataset, the hybrid VAE outperforms state-of-the-art neural topic models (NTMs), reaching a C_v coherence score of 0.71 and an NPMI score of 0.20 while preserving high topic diversity. For comparison, we also construct a fully classical variant, which also outperforms state-of-the-art models on AgNews and exhibits clear class separation in the latent space. These results demonstrate that hybrid VAEs are computationally viable even on NISQ-era devices and represent a promising direction for quantum-enhanced topic modeling.

[NLP-56] Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLM s ICML

【速读】：该论文旨在解决当前大型语言模型（LLM）在战略决策任务中评估体系单一化的问题，即现有博弈基准测试将复杂的推理能力压缩为单一标量指标，导致对前沿模型真实能力结构的考察不足。其核心解决方案是提出Poker Arena——一个基于无限制德州扑克的锦标赛平台，结合三层记忆架构（手牌内、会话内与跨会话）与九轴认知评估体系，将战略推理分解为可解释的维度（如下注尺度校准、位置意识等），实现多维度的能力刻画。通过在50个会话（每会话1,000手）及受控记忆消融实验中的评估发现，不同模型在筹码积累与综合轴向得分上的排名存在显著差异：Claude Opus 4.6以+15,730筹码和14次第一名的成绩夺冠，但在平均轴向得分上仅列第五；同时，持续记忆对部分模型有益，却对另一些产生负面影响。研究揭示，多轴评估能够揭示单标量排行榜所掩盖的真实能力结构，且跨维度一致性的重要性超过单一维度的峰值表现。

链接: https://arxiv.org/abs/2606.13815
作者: Pratham Singla,Shivank Garg,Vihan Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 33 pages, ICML Workshop

点击查看摘要

Abstract:Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We introduce Poker Arena, a no-limit Texas Hold’em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins + 15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.

[NLP-57] he Culture Funnel: You Cant Align What isnt in the Data

【速读】：该论文旨在解决当前大语言模型（LLM）在文化对齐（cultural alignment）中存在的重要问题：现有方法仅依赖推理阶段的干预，假设模型在训练过程中已充分吸收文化知识，但实际训练数据流中文化信号在后训练阶段显著衰减。研究发现，现代LLM的训练数据存在“文化数据漏斗”（cultural data funnel）现象——在预训练、微调、对齐和推理各阶段，显式的文化信号持续下降，而地理集中且任务专精的数据占据主导地位。尽管多语言性（multilinguality）提升了文化知识的地理多样性，但并未实现文化表征的均衡。该研究的关键解决方案在于构建一个跨阶段的多维标签框架（multidimensional tagging framework），系统性标注预训练、微调、对齐及推理数据集中的文化属性，并通过实证表明，基于该标签体系优化训练数据可显著提升下游文化基准任务的表现。研究强调，真正提升文化对齐能力需从源头重构训练数据管道，而非仅依赖后期干预。为促进后续研究，作者发布了包含560万样本的文化标注数据集。

链接: https://arxiv.org/abs/2606.13808
作者: Ananya Sahu,Mehrnaz Mofakhami,Daniel D’Souza,Thomas Euyang,Julia Kreutzer,Marzieh Fadaee
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional tagging framework across pretraining, fine-tuning, alignment, and reasoning datasets, we show explicit cultural signals decline sharply during post-training, while geographically concentrated, task-specialized data dominates. Multilinguality enhances geographic diversity of cultural knowledge but does not ensure balanced representation. Our tags improve downstream cultural benchmark performance, demonstrating that advances require shifting focus in training data pipelines. To facilitate future research, we release our culturally tagged dataset with 5.6M samples at this https URL.

[NLP-58] QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

【速读】：该论文旨在解决大语言模型在伊斯兰继承这一复杂宗教与法律领域中进行端到端推理的能力评估问题。传统问答基准难以全面考察模型在多步骤、高精度法律解释与结构化数值计算方面的真实表现，因此该研究设计了QIAS 2026共享任务，聚焦于从自然语言案例中完整推导出继承分配结果的全过程，涵盖继承人识别、份额计算及法律规则应用等关键环节。其解决方案的关键在于构建并基于MAWARITH基准数据集——一个包含12,500个阿拉伯语继承案例的标注数据集，每个案例均配有中间推理步骤和最终答案，以支持对多阶段推理过程的精细化评估；同时采用MIR-E（Multi-Step Inheritance Reasoning Evaluation）这一多阶段评价指标，量化系统在各推理阶段的表现。共有16支团队参与，尝试了提示工程、检索增强生成及微调等多种策略，结果显示当前语言模型在需要精确法律解读与结构化数值推理的任务阶段仍面临显著挑战。

链接: https://arxiv.org/abs/2606.13756
作者: Abdessalam Bouchekif,Somaya Eltanbouly,Samer Rashwani,Shahd Gaben,Mutaz Al-Khatib,Heba Sbahi,Emad Mohamed,Mohammed Ghaly
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive overview of the QIAS 2026 shared task, organized as part of the OSACT7 Workshop and co-located with LREC 2026. The shared task was designed to evaluate the ability of large language models to perform complex reasoning in the religious and legal domain of Islamic inheritance. Unlike conventional question-answering benchmarks, QIAS 2026 focuses on end-to-end reasoning from natural language cases, requiring systems to perform the full inheritance calculation process, from identifying the eligible heirs to assigning the correct share to each beneficiary. To support this evaluation, the task was based on the MAWARITH benchmark, a dataset of 12,500 Arabic inheritance cases annotated with intermediate reasoning steps and final answers. System submissions were evaluated using MIR-E, a multi-step metric that measures performance across the main stages of inheritance reasoning. A total of 16 teams participated in the shared task, investigating a range of approaches, including prompting-based methods, retrieval-augmented generation, and fine-tuning strategies. The results show that Islamic inheritance remains a highly challenging benchmark for current language models, especially in stages that require precise legal interpretation and structured numerical reasoning. This overview summarizes the task design, dataset, evaluation framework, participating systems, and main results.

[NLP-59] Which Models Perform Better in Inheritance Reasoning ?

【速读】：该论文旨在解决阿拉伯伊斯兰继承法推理任务中大型语言模型在法律解释、多步推理与精确数值计算方面的能力评估问题。其核心挑战在于如何在无需大量任务特定微调的前提下，实现结构化法律推理的高可靠性。解决方案的关键在于采用统一的提示策略（prompting strategy），对商用模型与开源模型进行公平对比，以评估二者在复杂法律规则应用中的表现差异。实验结果表明，商用模型在识别合格继承人、应用排除规则及推理一致性方面显著优于开源模型，尤其在涉及依赖性法律判断和分数份额调整的情境下，开源模型表现出更强的不稳定性；其中，Gemini 2.5 Flash 在所有模型中表现最佳，其平均相对误差（MRE）达到0.989，体现了商用模型在复杂法律推理任务中的优越性。

链接: https://arxiv.org/abs/2606.13751
作者: Mohammed Amine Mouhoub,Chahinez Bouchekif
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi-step reasoning, and precise numerical computation. We compare \textitcommercial and \textitopen-source models under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task-specific adaptation. \ Our results show a clear gap in reliability between the two model families. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps. In contrast, open-source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments. The best performance is achieved by \textitGemini 2.5 Flash, with an MRE of 0.989 .

[NLP-60] Multimodal Speaker Identification in Classroom Environments

【速读】：该论文旨在解决K-12课堂动态自动化分析中因背景噪声和儿童语音变异性导致的声学模型性能受限问题，尤其在学生身份识别任务中表现不佳。其核心挑战在于纯声学特征难以有效区分相似发音的儿童个体。解决方案的关键在于提出一种多模态说话人识别框架，通过将声学嵌入（acoustic embeddings）与大语言模型（LLM）生成的语义上下文进行“语义锚定”（contextual anchoring），从而增强声学特征的判别能力。实验基于EDSI数据集中的8个数学课堂（共2,801条语音片段），结果显示，引入基于文本转录的上下文信息后，结合梯度提升分类器，学生身份识别准确率从声学基线的39.0%显著提升至50.3%；对于时长超过5秒的语音片段，准确率进一步达到76.9%（基线为64.9%），Top-3准确率达90.9%。此外，模型在教师与学生角色区分上实现了99.3%的高精度。该方法显著提升了自动化课堂分析系统对个体学生参与度的识别能力，为实现规模化、公平性导向的教学反馈系统提供了关键技术支撑。

链接: https://arxiv.org/abs/2606.13712
作者: Michael L. Chrzan,Meghavarshini Krishnaswamy,Robert Gibboni,Katie Wetstone,Wei Ai,Jing Liu
机构: 1. University of California, San Diego (加州大学圣地亚哥分校); 2. Google(谷歌); 3. OpenAI
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: 9 pages, 5 tables, 3 figures

点击查看摘要

Abstract:Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based “contextual anchoring” into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.

[NLP-61] Orchestra-o1: Omnimodal Agent Orchestration

【速读】：该论文旨在解决多模态环境下大型语言模型（LLM）代理系统中代理编排（agent orchestration）的泛化能力不足问题，特别是在异构模态（如文本、图像、音频、视频）共存且交互复杂的全模态（omnimodal）场景下，现有编排框架因局限于特定模态而难以有效支持任务分解与协作。其解决方案的关键在于提出一种名为Orchestra-o1的全模态代理编排框架，通过引入统一的编排机制，实现模态感知的任务分解、在线子代理专业化（online sub-agent specialization）以及并行子任务执行，从而在异构信息源共存的复杂现实任务中实现高效代理协同。此外，该框架结合决策对齐组相对策略优化（DA-GRPO），一种高效的智能体强化学习训练方法，使Orchestra-o1-8B在OmniGAIA基准测试上达到10.3%的准确率提升，并在所有开源全模态代理中表现最优。

链接: https://arxiv.org/abs/2606.13707
作者: Fan Zhang,Vireo Zhang,Shengju Qian,Haoxuan Li,Hao Wu,Jinyang Wu,Donghao Zhou,Zhihong Zhu,Zheng Lian,Xin Wang,Pheng-Ann Heng
机构: CUHK(香港中文大学); LIGHTSPEED; PKU(北京大学); THU(清华大学); Tongji University(同济大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

[NLP-62] Incentives Of EdTech: A Systematic Review Of EduNLP Research ACL2026

【速读】：该论文旨在解决当前教育自然语言处理（Educational Natural Language Processing, EduNLP）研究中存在的一项核心矛盾：在推动教育技术（EdTech）发展的同时，未能充分回应教育系统各利益相关方的真实需求，尤其忽视了教师群体的核心地位。研究通过系统性文献综述分析了2024—2025年计算语言学协会（ACL）教育应用特别兴趣小组（SIGEDU）相关会议发表的204篇论文，并与更广泛的ACL文集中的EdTech研究进行对比，揭示出当前EduNLP研究存在三大关键问题：一是教师作为最直接的利益相关者，在研究成果中仅被列为受益者的33.3%，严重代表性不足；二是真实世界部署率极低，仅为9.8%，表明研究与实践之间存在显著脱节；三是伦理考量多停留于形式化声明，缺乏实质性行动。其解决方案的关键在于重构EduNLP研究范式，强调以教师为中心的设计、推动真实场景下的可部署性，并建立基于行动的伦理参与机制。论文基于对代表性研究的分析，提出了一系列具体且可操作的建议，以促进更加负责任和可持续的教育人工智能研究实践。

链接: https://arxiv.org/abs/2606.13691
作者: Gabrielle Gaudeau,Aoife O’Driscoll,Jasper Degraeuwe,Andrew Caines,Donya Rooein,Zeerak Talat
机构: University of Cambridge (剑桥大学); Ghent University (根特大学); Bocconi University (博科尼大学); University of Edinburgh (爱丁堡大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 10 main pages (13 appendix pages), 20 figures, accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications @ ACL 2026

点击查看摘要

Abstract:While the Natural Language Processing community has dedicated significant resources in developing educational technologies (EdTech) that support this shift, it remains unclear whose interests are being best served among the stakeholders of education. In this paper, we present a systematic literature review of 204 papers published in venues of the Association for Computational Linguistics’ Special Interest Group on Building Educational Applications in 2024 and 2025, and validate these against EdTech papers from the wider ACL Anthology. By examining stakeholder inclusion and the prioritisation of research tasks, our findings reveal a critical tension: a push and pull between private-sector incentives and the foundational needs of educational infrastructure. Our analysis reveals that teachers are systematically under-represented as beneficiaries of research (33.3%) despite being the most affected, that real-world deployment remains rare (9.8%), and that ethical engagement tends toward acknowledgement rather than action. Drawing on exemplary papers in our corpus, we offer concrete recommendations for more responsible EduNLP research practices. Comments: 10 main pages (13 appendix pages), 20 figures, accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications @ ACL 2026 Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2606.13691 [cs.CY] (or arXiv:2606.13691v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2606.13691 Focus to learn more arXiv-issued DOI via DataCite

[NLP-63] Indirect Computing Model with Indirect Formal Method

【速读】：该论文旨在解决传统通用数字计算机范式在应对复杂、异构信息处理需求时存在的局限性，尤其是在云数据中心向知识中心演进过程中，如何实现高效、智能的协同计算与信息融合。其核心问题在于现有计算模型难以有效支持大规模与小规模信息（如长文本与短指令）的统一处理，且缺乏对人机协同智能系统的理论支撑。解决方案的关键在于提出一种间接计算模型（indirect computing model） 与间接形式理论（indirect formal theory），该体系兼容大字符串与小字符串的处理，并基于图灵可计算性理论、克林的小字符串形式理论、冯·诺依曼数字计算机架构以及图灵对人工智能判断的假设进行系统性重构。通过将人机交互界面与协同计算程序深度融合，构建一个协同式智能计算系统原型，以中国信息数据为实例验证其可行性。该方法实现了从数据驱动到知识驱动的云计算优化，推动计算范式从传统数据中心向智能化知识中心跃迁。

链接: https://arxiv.org/abs/2606.13690
作者: Xiaohui Zou
机构: China University of Geosciences (Beijing)(中国地质大学（北京）); Tsinghua Science Park (清华大学科技园); UC Berkeley (加州大学伯克利分校)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:This paper,from the perspective of a collaborative intelligent computing system formed by combining human-computer interface and collaborative computing programs, discusses the principles of optimized cloud computing technology supported by the combination of an indirect computing model and an indirect formal method. On the basis of systematically reviewing the influence of previous theoretical achievements Turing’s computability theory,Kleene’s formal theory of small strings,von Neumann’s digital computer architecture and Turing’s hypothesis on AI judgment on the mainstream general-purpose digital computer paradigm,the author focuses on introducing an indirect computing model and an indirect formal theory compatible with both large and small strings. Using Chinese information data as an example,the design concept of a collaborative intelligent computing system prototype is presented. The significance is that this achievement facilitates optimization of cloud computing from data centers to knowledge centers.

[NLP-64] Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces ACL2026

【速读】：该论文旨在解决自主网页代理（autonomous web agents）在真实电商场景中面对欺骗性界面时的安全性问题。随着网页代理被广泛应用于现实任务，其在遭遇诱导性设计（如虚假广告、域名重定向、购物操纵等）时极易遭受误导，从而导致行为失准甚至产生严重后果。为此，论文提出WebDecept——一个轻量级且可配置的插件框架，能够以可控方式将七类典型欺骗性界面模式注入现有网页环境，实现对多模态网页代理在真实攻击场景下的系统性评估。其解决方案的关键在于通过前端动态注入欺骗模式，构建可复现、可控制的对抗性测试环境，从而揭示当前代理模型对提示工程约束的脆弱性，并深入分析欺骗设计特征如何影响攻击成功率。研究结果表明，现有代理普遍缺乏对复杂欺骗行为的鲁棒性，亟需在安全机制设计层面引入更有效的防护策略，以支持其向真实世界部署的可靠推进。

链接: https://arxiv.org/abs/2606.13686
作者: Zijing Shi,Meng Fang,Ling Chen
机构: University of Technology Sydney (悉尼科技大学); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce domain. We introduce WebDecept, a lightweight and configurable plugin framework that enables controlled injection of deceptive interface patterns into existing web environments. Using WebDecept, we instantiate seven deceptive patterns commonly observed on the open web, including targeted advertisements, domain redirection, and shopping manipulation. By injecting these patterns into the frontend during task execution, we perform controlled evaluation of multiple multimodal web agents. Our results show that current web agents are highly susceptible to multiple classes of deceptive interfaces, and that prompt-based constraints are often insufficient to mitigate these failures. We further analyze how the design choices of deceptive patterns influence the success of such manipulations. These findings highlight safety challenges that should be addressed as web agents are scaled toward real-world deployment.

[NLP-65] he Coin Flip Judge? Reliability and Bias in LLM -as-a-Judge Evaluation

【速读】：该论文旨在解决大语言模型作为评判者（LLM-as-a-Judge）在重复评估中存在显著不稳定性的问题，尤其关注其在模型输出排序、奖励模型训练及排行榜构建等关键应用场景中的可靠性。研究发现，尽管点均分差异较小且整体不显著，但成对比较（pairwise preference）的胜负结果在重复评估中平均有13.6%发生翻转，部分问题甚至高达56%，表明单次评估结果具有高度随机性；同时，GPT-4o-mini表现出明显的首位置偏好（first-position bias），进一步加剧了评估偏差。此外，跨模型一致性仅达76%（κ = 0.51），语义等价的提示模板也会导致多数意见改变，说明评估结果对提示细节敏感。研究表明，为获得稳定可靠的结论，需采用多轮评估聚合（multi-trial aggregation）、回答顺序随机化（position randomization）以及显式报告不确定性等方法，以提升评估的可信度。由于实验仅基于同一厂商提供的两个模型，未来跨提供商的复现验证仍是亟待推进的关键方向。

链接: https://arxiv.org/abs/2606.13685
作者: Abel Yagubyan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19–0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise–pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ( \kappa = 0.51 ), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.

[NLP-66] Cross-Dataset Bloom Question Classification: Supervised Models and Prompted LLM s

【速读】：该论文旨在解决教育评估题目的布鲁姆分类（Bloom’s Taxonomy）自动标注问题，以降低教师的工作负担。传统方法依赖人工标注，存在主观性强、依赖教师经验的问题。现有机器学习（ML）与深度学习（DL）模型虽在单一数据集上表现良好，但在跨数据集场景下的泛化能力较差，难以应用于真实多样的教学环境；同时，大语言模型（LLM）在布鲁姆分类任务中的有效性尚未得到系统性研究。本文通过在五个不同数据集上评估现有ML/DL方法的跨数据集泛化性能，并对比多种提示（prompting）策略下LLM的表现，发现结合上下文示例与课程特定动作动词的最佳提示策略显著提升了分类效果。实验表明，监督式ML/DL模型在未见数据集上性能大幅下降，而LLM表现出更强的稳定性，具备更优的跨场景适应能力。基于最优提示策略，研究进一步设计了一款轻量级用户界面（UI），支持教师高效自动化处理大规模题库，可用性研究表明该工具可显著降低工作负荷并具有高易用性。因此，解决方案的关键在于采用融合上下文示例与领域特定动作动词的提示工程策略，充分发挥LLM在跨数据集场景下的鲁棒性与可扩展性优势。

链接: https://arxiv.org/abs/2606.13684
作者: Abdolali Faraji,Mohammadreza Molavi,Zohreh Rasoulkhani,Mohammadreza Tavakoli,Gábor Kismihók
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at AIED 2026. Abdolali Faraji and Mohammadreza Molavi contributed equally to this work

点击查看摘要

Abstract:Automatic Bloom’s taxonomy classification of assessment questions can substantially reduce instructor workload, but labeling is subjective and teacher-dependent. Prior machine learning (ML) and deep learning (DL) approaches reported strong within-dataset results, yet were rarely evaluated in cross-dataset settings, leaving real-world generalizability unclear; meanwhile, LLM effectiveness for Bloom question classification has not been systematically studied. We evaluated the cross-dataset generalization of existing ML/DL methods and assessed LLMs with multiple prompting strategies on five datasets; the best prompting strategy combined in-context examples with course-specific action verbs. Supervised ML/DL models degraded substantially on unseen datasets, whereas LLMs were more stable, suggesting a robust alternative across diverse educational contexts. Based on the best prompting strategy, we also presented a lightweight UI that supports instructors in automatically classifying large question banks; a usability study indicated low workload and high usability.

[NLP-67] UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

【速读】：该论文旨在解决当前对话策略规划方法难以动态适应多样化用户特征的问题。其核心解决方案在于提出一种基于用户画像的嵌套回溯策略自适应（User Portrait based Nested Rollout Policy Adaptation, UP-NRPA）在线框架，该框架利用大语言模型（Large Language Models, LLMs）实现无需离线强化学习的实时策略定制。关键创新在于通过实时用户反馈与当前用户画像中映射出的人格特质、偏好及目标，动态调整对话策略，从而在不依赖预训练或离线强化学习模型的前提下，实现对不同用户特征的自适应响应。实验结果表明，UP-NRPA在协作与非协作对话基准任务中均表现优异，多项任务达成100%成功率，尤其在谈判场景中，销售-报价比（Sale-to-List ratio, SL）提升56.41%，验证了其在无训练机制下高效适配多样用户需求的能力。

链接: https://arxiv.org/abs/2606.13683
作者: Hui Wang,Fafa Zhang,Meng Liu,Xiangyu Chen,Chaoxu Mu
机构: Anhui University (安徽大学); Anhui Provincial Key Laboratory of Security Artificial Intelligence (安徽省安全人工智能重点实验室); Pengcheng Laboratory (鹏城实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.

[NLP-68] GAGPO: Generalized Advantage Grouped Policy Optimization

【速读】：该论文旨在解决多轮交互环境中强化学习代理在延迟奖励下的时序信用分配（temporal credit assignment）难题。在典型场景中，智能体仅在完整任务轨迹结束后获得稀疏的全局奖励，难以准确判断各中间决策步骤对最终结果的贡献，导致优化信号不精确且难以有效传播。现有方法通常依赖于代价高昂的辅助值函数模型来估计状态价值，而本文提出一种无评判器（critic-free）的解决方案——广义优势分组策略优化（Generalized Advantage Grouped Policy Optimization, GAGPO）。其核心创新在于：通过采样轨迹构建非参数化的分组价值代理（grouped value proxy），并基于该代理计算类似TD/GAE的时序优势，实现对最终奖励的递归回溯；结合分组优势归一化与动作级重要性比率，直接从多轮轨迹中提取稳定、局部化的优化信号。实验表明，GAGPO在ALFWorld和WebShop基准上显著优于主流强化学习基线，展现出更快的早期学习速度、更高的交互效率及更平滑的优化动态，验证了其作为高效、简洁的多轮智能体强化学习框架的有效性。

链接: https://arxiv.org/abs/2605.13217
作者: Siyuan Zhu,Chao Yu,Rongxin Yang,Zongkai Liu,Jinjun Hu,Qiwen Chen,Yibo Zhang
机构: Sun Yat-sen University (中山大学); Meituan (美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise advantage normalization and an action-level importance ratio, GAGPO extracts stable, localized optimization signals directly from multi-turn trajectories. Experiments on ALFWorld and WebShop show that GAGPO outperforms strong reinforcement learning baselines. Further analyses demonstrate faster early-stage learning, improved interaction efficiency, and smoother optimization dynamics, suggesting that GAGPO offers a simple yet effective framework for multi-turn agentic reinforcement learning.

[NLP-69] Detecting Historical Turning Points in Italian Media: A Complex Systems Approach to a Diachronic News Corpus

【速读】：该论文旨在解决历史研究中缺乏大规模、连续且具有历史意义的前数字时代语料库的问题，尤其针对时间跨度长、社会政治变革剧烈时期的数据稀缺性挑战。其解决方案的关键在于利用自然语言处理（Natural Language Processing, NLP）技术，对1985年至2000年间意大利《共和国报》（La Repubblica）约60万篇新闻文章进行系统性重建与量化分析，构建一个历时性（diachronic）语料库。通过在词汇层与语义层上应用NLP方法，并结合复杂系统与统计物理中的分析工具，该研究实现了对媒体话语演变的非监督式动态追踪，成功识别出如意大利第一共和国向第二共和国过渡、海湾战争、科索沃战争等关键历史转折点。这一方法突破了传统历史分析依赖人工标注或先验框架的局限，展示了计算语言学与复杂系统理论融合在揭示社会变迁动力学方面的潜力，为基于大规模文本数据研究媒体与社会演化提供了新的定量路径。

链接: https://arxiv.org/abs/2606.14348
作者: Dario Zarcone,Salvatore Miccichè,David Sanchez
机构: University of Palermo (帕勒莫大学); Institute for Cross-Disciplinary Physics and Complex Systems (IFISC), UIB-CSIC (跨学科物理与复杂系统研究所， UIB-CSIC)
类目: Physics and Society (physics.soc-ph); Statistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL)
备注: 16 pages, 9 figures, 1 table

点击查看摘要

Abstract:The increasing availability of large-scale textual corpora has opened new possibilities for data-driven, quantitative approaches to historical analysis using Natural Language Processing (NLP). However, diachronic corpora with historical relevance from the pre-digital era remain scarce and often incomplete. We present a quantitative approach to historical analysis based on the reconstruction and exploration of a diachronic corpus of around 600,000 articles from the Italian newspaper “La Repubblica”, covering all the articles published from the 1st of January 1985 to the 31st of December 2000 - a period of major political, social, and geopolitical change in Italy and globally. Using NLP techniques, we analyze the text at both lexical and semantic levels; we then apply tools from complex systems and statistical physics to trace shifts in media discourse over time. This allows us to detect key transition periods, such as the transition from the First Republic to the Second Republic in Italy, or major international conflicts like the Gulf War or the Kosovo War, without relying on prior labeling. The results show how combining computational linguistics with ideas from complex systems can offer new quantitative insight into historical changes, opening up new paths for studying the dynamics of media and society through large-scale textual data.

信息检索

[IR-0] Private Information Retrieval for Large-Scale DNA-Based Data Storag e

链接: https://arxiv.org/abs/2606.14557
作者: Gökberk Erdoğan,Daniella Bar-Lev,Rawad Bitar,Antonia Wachter-Zeh,Zohar Yakhini
类目: Information Retrieval (cs.IR)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:We investigate Private Information Retrieval (PIR) in the context of synthetic DNA-based data storage. While PIR is a well-studied primitive for digital databases, extending it to DNA-based databases presents unique challenges arising from biochemical query mechanisms and their complexity. We propose two approaches for adapting two-server PIR protocols to DNA-based storage, balancing privacy, efficiency, and feasibility. These approaches illustrate how information-theoretic privacy trade-offs manifest in DNA-based storage systems.

[IR-1] Verifiable User Simulation for Search and Recommendation Systems SIGIR2026

链接: https://arxiv.org/abs/2606.14474
作者: Chenglong Ma,Xinye Wanyan,Danula Hettiachchi,Ziqi Xu,Yongli Ren,Jeffrey Chan
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注: Presented as a half-day tutorial at SIGIR 2026, 4 pages

点击查看摘要

Abstract:Large-language-model (LLM) based user simulation is increasingly adopted for evaluating search engines, recommender systems, and retrieval-augmented generation pipelines, yet most simulators remain opaque: it is difficult to determine why a simulated user made a particular choice or whether that choice is consistent with the intended user profile. Compounding this, recent research shows that LLMs can produce biased or discriminatory responses depending on user background characteristics such as language, education level, and cultural context, raising concerns about the equitable treatment of minority and disadvantaged groups. This half-day, in-person tutorial introduces a proposed design-and-audit framework that treats a user simulator as a verifiable engineering artefact composed of seven auditable components - structured Persona, task-aware Contract, matched human-vs-agent Execution, auditable Trace, persona-aligned Verification, structured Feedback, and a Refinement loop that updates personas and contracts. Through two hands-on mini-labs on recommendation-list evaluation and search-query formulation, participants will inspect simulator behaviour end-to-end, distinguish diagnostic discrepancy analysis from statistical validation, and apply checks for fidelity, credibility, and demographic bias. The tutorial targets information retrieval and recommender systems researchers and practitioners interested in user behaviour simulation and responsible AI.

[IR-2] ScoreGate: Adaptive Chunk Selection for Retrieval-Augmented Generation via Dual-Score Statistical Fusion

链接: https://arxiv.org/abs/2606.14269
作者: Karamvir Singh,Arvind Jain
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 20 pages, 6 figures, 14 tables

点击查看摘要

Abstract:Fixed-cardinality retrieval injects a constant top-K chunks into the generator regardless of query complexity, causing over-retrieval for narrow queries and under-retrieval for compositional ones. We describe ScoreGate, a lightweight score-space decision mechanism that controls retrieval cardinality at inference time using two scores already produced by the standard pipeline: bi-encoder similarity s_i and cross-encoder reranker score r_i, with no additional model inference calls required. Its core insight is that cross-encoder affirmation can rescue semantically relevant chunks that bi-encoder retrieval ranks poorly due to vocabulary mismatch – a failure mode unaddressed by fixed-K or single-score thresholding. On MS MARCO (200 dev queries), ScoreGate achieves MRR@10 = 0.401 with 35% fewer retained chunks than Standard Top-K. On an internal benchmark (n=300, Fleiss’ kappa=0.87), ScoreGate observed zero false positives (95% CI [96.4%, 100%]) at 97.77-99.34% recall, with 34.8% fewer tokens per query and only 31ms added latency. Results on both MS MARCO and real-world production traffic suggest that adaptive retrieval cardinality can improve retrieval efficiency without degrading retrieval quality.

[IR-3] ChronoID: Infusing Explicit Temporal Signals into Semantic IDs for Generative Recommendation

链接: https://arxiv.org/abs/2606.14260
作者: Dongdong Nian,Dongqi Fu,Chenliang Xu,Yinglong Xia,Hong Li,Hong Yan,Jian Kang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic IDs are crucial in generative recommendation, but with a fundamental limitation: temporal information is not well incorporated into semantic IDs. Instead, time influences recommendation only implicitly (e.g., through session construction heuristics, preference alignment, or sequence order), while existing semantic ID learning remains entirely time-agnostic. This design conflates interactions occurring under distinct temporal contexts into identical semantic representations, implicitly assuming that item semantics and user intent are temporally stationary. Such an assumption is misaligned with real-world recommendation scenarios, where evolving interaction rhythms play a central role. In this work, we investigate where and how the explicit time should be incorporated into semantic ID for generative recommendation. First, we systematically characterize the design space along three orthogonal dimensions of temporal signals and present a unified framework, ChronoID, for time-aware semantic ID learning. Then, by contributing a new time-explicit generation recommendation benchmark, ChronoID answers the questions: what is the effective way of infusing time, how to design the architecture, and where does the gain come from.

[IR-4] CoRe: A Continuously Reward-Finetuned LLM Query Rewriter for Multi-Stage Context-Aware Relevance in Web-Scale Video Search

链接: https://arxiv.org/abs/2606.14127
作者: Yilin Wen,Rong Yang,Xiaojia Chang,Hong Sun,Gefu Tang,Chunhui Liu,Jeffrey Chen,Zeyu Ma,Lisong Qiu,Xiaochuan Fan,Congjia Yu,Quan Zhou,Yuheng Chen,Zian Wang
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:LLM-based query rewriters in production face a tension: the training reward must reflect how the rewrite is consumed by the production ranker, yet the training procedure must be cheap enough to support continuous redeployment as data drifts. We present CoRe (Context Relevance), such a system, redeployed weekly for over five months in a major short-video search engine. Our reward uses the deployed multimodal relevance model as its source and a multiplicative ratio form mirroring the production fusion algebra, closing the simulation-production gap that offline reward proxies leave open. A semi-online Mixed Preference Optimization loop makes this reward affordable at multi-million-instance weekly scale: a DPO-style pairwise objective restricts the gradient pass to a small top-k/bottom-k subset of sampled trajectories, and a phase structure reduces trainer/inference-server parameter syncs from per-step to per-phase. An automated promotion gate over reward-like and stability metrics detected and recovered from a real reward-hacking incident in production. Rewriter output is consumed as parallel relevance signals at recall, rawrank, and finerank without displacing the original signals, bounding rewriter-failure blast radius. Online A/B from two sequential production launches, first deploying the rewriter at finerank, then extending consumption to recall and rawrank, delivers statistically significant reductions in change-query rate on rewrite-impacted queries, with all headline relevance and engagement metrics moving in the expected direction.

[IR-5] Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

链接: https://arxiv.org/abs/2606.14047
作者: Ghadir Alselwi,Basem Suleiman,Hao Xue,Shoaib Jameel,Hakim Hacid,Flora D. Salim,Imran Razzak
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-context language modeling requires not only extending context windows but maintaining coherent understanding of entity states and relationships across thousands of tokens – a challenge that semantic similarity alone cannot address. KGERMAR addresses this by constructing dynamic, context-specific knowledge graphs from input text during inference, enabling domain-adaptive retrieval that leverages both semantic similarity and explicit entity relationships. The framework performs real-time entity and relation extraction to build contextual knowledge graphs, then integrates graph-structural embeddings with textual semantics through a multi-component memory architecture. Three memory banks – contextual, semantic, and structural – are maintained with retrieval signals fused via learned weights to capture both surface-level semantics and deeper relational patterns. Evaluated on SlimPajama (84.7K training examples), WikiText-103 (4,358 examples), PG-19 (100 examples), and Proof-pile (46.3K examples), KGERMAR achieves up to 8.5% lower perplexity and 2–2.5x better memory efficiency than memory-augmented baselines across context lengths from 1K to 32K tokens, with superior in-context learning performance across five NLU tasks. The dynamic knowledge graph construction approach advances memory-augmented language modeling by enabling domain-specific knowledge representation that adapts to input contexts rather than relying on fixed knowledge bases.

[IR-6] When Recommendation Denoising Meets Popularity Bias: Understanding and Mitigating Their Interaction

链接: https://arxiv.org/abs/2606.14046
作者: Guohang Zeng,Jie Lu,Guangquan Zhang
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Implicit feedback is the dominant data source for recommender systems, but behavioral logs are often contaminated by false-positive interactions caused by mis-clicks, biased exposure, and interface effects. Denoising recommendation methods improve robustness by down-weighting or filtering interactions suspected to be noisy, often relying on the small-loss heuristic. We revisit this heuristic through the lens of popularity bias. Tail-item positives can be harder to fit because they are sparsely observed, and thus may receive larger losses even when they reflect genuine user preference. Under such popularity-dependent loss patterns, monotone loss-based reweighting can suppress clean-but-hard tail signals and increase the head-tail imbalance in effective supervision. We formalize this interaction through the effective head-tail signal ratio induced by denoising weights and derive a conditional reallocation result: when the loss distribution of tail positives is right-shifted relative to that of head positives, small-loss reweighting increases the effective head-tail signal ratio compared with ERM. Motivated by this analysis, we propose Popularity-Aware Denoising (PAD), a lightweight plug-in framework that modulates denoising strength by item popularity. PAD applies stronger denoising to highly exposed items while being more conservative on tail items, preserving more clean-but-hard long-tail signals. Experiments on three datasets and three backbones show that PAD generally improves over representative denoising baselines and provides favorable accuracy-diversity tradeoffs, especially on MF-style recommenders. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.14046 [cs.IR] (or arXiv:2606.14046v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.14046 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-7] ADORE: Iterative Query Expansion with Retrieval-Grounded Relevance Feedback

链接: https://arxiv.org/abs/2606.13905
作者: Amin Bigdeli,Negar Arabzadeh,Radin Hamidi Rad,Sajad Ebrahimi,Charles L. A. Clarke,Ebrahim Bagheri
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based query expansion improves retrieval by enriching the original query with additional context. Yet most methods remain generation-driven, producing plausible pseudo-documents or expansions without checking how the target corpus responds. This can introduce retrieval drift, amplify misleading vocabulary, or miss terms that distinguish relevant from non-relevant documents. We argue that effective expansion requires retrieval-grounded feedback, not just single-pass generation or unverified iteration. We introduce ADORE (ADapt, Observe, Relevance Evaluate), an iterative framework that turns retrieval outcomes into feedback for the next expansion. At each round, an LLM generates pseudo-passages, a retriever exposes the corpus response, and a relevance assessor evaluates retrieved documents against the original query. These judgments identify what to reinforce, what remains undercovered, and what to suppress. Across TREC Deep Learning, BEIR, and BRIGHT, ADORE consistently outperforms strong query expansion baselines with notable improvements across nearly all evaluation settings, improving average nDCG@10 by 24.5% over BM25 and 3.6% over the strongest prior query expansion method on BEIR, and by 122.9% over BM25 and 9.2% over the best query expansion baseline on BRIGHT. Our code and data are publicly available.

[IR-8] Mood-Aware Music Recommendation: Integrating User Affective Signals into Ranking Systems

链接: https://arxiv.org/abs/2606.13858
作者: Terence Zeng,Abhishek K. Umrawal
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, and 1 table

点击查看摘要

Abstract:Recommendation systems are essential in modern music streaming platforms due to the vast amount of available content. While collaborative filtering is widely used to suggest items based on the preferences of others with similar patterns, it performs poorly in domains where user-item interactions are sparse, such as music. Content-based filtering is an alternative approach that examines the qualities of the items themselves. Genre, instrumentation, and lyrics have been explored; however, relatively little attention has been given to emotion recognition. Since a user’s emotional state strongly influences their music choice, incorporating mood signals offers a promising direction for personalization. In this work, we propose a mood-conditioned ranking framework that integrates user affective signals into the recommendation process via softmax-based sampling in the energy-valence space. We evaluate the approach via single-blind experiments in which participants compare recommendations from the proposed system against a baseline. The results indicate improved perceived recommendation quality, providing preliminary evidence for the effectiveness of incorporating mood-based inputs into music recommendations.

[IR-9] Hybrid Neural Retrieval with Generative Query Refinement for Quranic Passage Retrieval

链接: https://arxiv.org/abs/2606.13837
作者: Mohamed G. Salman,Mohammad E. Moftah,Ali Hamdi
类目: Information Retrieval (cs.IR)
备注: Accepted for presentation at the Intelligent Methods, Systems, and Applications (IMSA) 2026 conference. \c{opyright} 2026 IEEE

点击查看摘要

Abstract:Quranic Passage Retrieval (PR) could be a challenging task due to the linguistic complexity and the semantic gap between the Modern Standard Arabic (MSA) used in daily queries and the Classical Arabic (CA) of the Holy Quran. These factors hinder conventional retrieval methods. To handle these limitations and improve multi-verse retrieval and filter the zero-answer queries, this paper proposes a four-phase neural architecture designed to enhance retrieval accuracy and contextual understanding. The methodology combines hybrid candidate retrieval using AraColBERT dense indexing and BM25 sparse retrieval, followed by semantic reranking with a CAMeLBERTmix cross-encoder. A confidence gating mechanism is then applied to filter zero-answer queries, and an AraT5-based refinement module for multi-verse aggregation. The system is evaluated on an expanded version of the Quran QA 2022 dataset. Results show improved performance compared to the baseline models, achieving a Recall@10 of 0.7024 and a Mean Average Precision (MAP@10) of 0.4947. While the system exhibits a marginal tradeoff in absolute top-rank precision (MRR = 0.5807) compared to heavily optimised single models, the proposed architecture provides a substantially more comprehensive, reliable, and context aware solution for multi-verse Quranic passage retrieval.

[IR-10] ASR: Training-Free Adaptive Stopping for Iterative Retrieval KDD2026

链接: https://arxiv.org/abs/2606.13814
作者: Adrian Kieback,Uyiosa Philip Amadasun,Aman Chadha,Aaron Elkins
类目: Information Retrieval (cs.IR)
备注: 9 pages, 5 figures. Accepted at Agent4IR Workshop, KDD 2026

点击查看摘要

Abstract:Iterative retrieval-augmented generation agents commonly overspend by continuing to retrieve after the model has converged on an answer, incurring calls that change neither the prediction nor the supporting evidence. Existing remedies learn a stopping policy from labeled trajectories, tying the decision to a trained component that requires retraining for each new model or task. We propose TASR (Training-Free Adaptive Stopping Rule), a one-line predicate that fires when the model repeats its previous-round normalized answer and the isotonically calibrated logit margin exceeds 0.25. No classifier or value head is learned; the threshold is fixed across all twenty-four (model, retriever, corpus) configurations we evaluate. On a 3-model x 2-dataset distractor grid, TASR retains 94.8% of fixed-k=5’s macro F1 at 62.6% of its calls and exceeds fixed-k=3 by +3.42 F1. The pattern holds on nine open-domain BM25 cells (55.01 F1 at 2.98 calls vs. 54.33 at 3.00 for fixed-k=3) and, with calibration locked from the distractor split, on nine dense-retrieval cells across two retriever families, with zero significant regressions in either extension. The rule was selected from an exhaustive enumeration of 381 candidate stopping rules; no alternative Pareto-dominates it on any evaluated configuration. A signal-quality analysis shows that verbalized 1-5 confidence collapses on RLHF-tuned models (96.5% of values equal 5, entropy 0.182 nats), while the logit margin achieves 44x better class-conditional separation, grounding the design in a measurable model pathology. TASR is an auditable, training-free Pareto baseline against which learned stopping controllers can be compared. Code is publicly available.

[IR-11] Nomenclature Ontology for Medical And Disease names (NOMAD): taxonomy of types and origins of disease names

链接: https://arxiv.org/abs/2606.13719
作者: Spiros Denaxas,Cai Ytsma,Giannos Louloudis,Jackie MacArthur,Harry Hemingway
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The nomenclature of human disease has developed organically over the past centuries using Greek, Latin, and Arabic terminology and reflects the idiosyncrasies of different eras of medical discovery. Despite evident heterogeneity in naming practices, no systematic framework exists for characterising these conventions across all diseases. In this paper, we describe the Nomenclature Ontology for Medical And Disease names (NOMAD), a meta-taxonomy that classifies disease names according to their naming conventions. We developed a two-level taxonomy comprising 9 top-level categories and 20 subcategories and applied it to 22,548 index entries from the ICD-10-CM 2026 Alphabetical Index in a scalable three-stage machine learning-driven classification pipeline. Classification was multi-label, reflecting the compositional nature of medical nomenclature. We classified 99.1% of terms with a mean of 2.12 labels per entry. Anatomical categories were the most prevalent (63.8% of entries), followed by Descriptive (48.4%) and Pathophysiological (40.2%), while Eponymous and Geographical labels were less common than their cultural prominence might suggest (9.7% and 1.9% respectively). Among all Eponymous diseases, we identified only 57 (2.6%) of diseases named after a female person. We manually reviewed a random sample of n=2,255 entries (10%) for accuracy and calculated a full agreement rate of 70% and partial agreement rate of 26% (macro-averaged Cohen’s Kappa score 0.832). Naming convention profiles varied substantially across ICD-10-CM chapters, reflecting specialty-specific epistemological traditions: infectious disease chapters were dominated by etiological labels and showed the highest proportion of geographical region related labels, the circulatory chapter by anatomical and pathophysiological labels, and mental and behavioural disorders showed the highest prevalence of socio-behavioral labels.

[IR-12] Personalization and Evaluation of Conversational Information Access

链接: https://arxiv.org/abs/2606.13717
作者: Hideaki Joko
类目: Information Retrieval (cs.IR)
备注: PhD Thesis of Hideaki Joko (Radboud University, the Netherlands)

点击查看摘要

Abstract:Conversational interactions have reshaped information retrieval systems, as users increasingly favour direct answers over traditional hyperlinks. To build reliable Conversational Information Access (CIA) systems that account for personal context, this thesis addresses challenges: (1) personal context extraction, (2) personalized response generation, and (3) effective and interpretable system evaluation. First, we tackle personal context extraction by studying what Entity Linking (EL) in conversations entails, introducing a dataset for conversational entity linking (ConEL), and proposing CREL, a novel EL method tailored for conversational settings. Second, we focus on personalized response generation by proposing LAPS, a method for efficiently constructing large-scale, human-written, personalized conversational datasets, and using them to study how users’ preferences can be utilized to generate personalized responses. Finally, we address the need for effective and interpretable system evaluation by introducing FACE, an automatic, reference-free method that assesses entire conversations and aligns closely with human judgments.

人机交互

[HC-0] he Self-Aware Body: A User-Centered Framework for Designing Therapeutic Sonic Interactions

链接: https://arxiv.org/abs/2606.14664
作者: Prithvi Ravi Kantan,Sofia Dahl,Erika G. Spaich
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This chapter presents a framework for designing therapeutic sonic interaction technologies, with a focus on movement sonification: the real-time conversion of bodily motion into sound that serves as feedback during motor rehabilitation. Despite growing evidence for their effectiveness, technologies implementing movement sonification are yet to be systematically adopted as part of clinical practice, potentially due to a lack of standardized development methodologies as well as inadequate integration of clinical stakeholder perspectives into interaction design. The framework addresses these barriers through three interconnected contributions. The first is a conceptual reframing of the design task as the calibration of sonic variability to the perceptual affordances of the listener and the demands of the clinical context. The second is a practical design platform inspired by professional audio mixing workflows, which imposes a structured and learnable signal-flow architecture on the interaction design process and enables rapid iterative exploration. The third is a user-centered development methodology adapted from healthcare intervention science, which grounds design decisions in engagement with the clinicians and patients who will use the resulting systems. The HearWalk biofeedback system for hemiparetic gait rehabilitation illustrates the framework, and the chapter concludes by examining where large language models and AI tools can meaningfully assist each stage of this design process, as well as where human clinical and perceptual expertise remains irreplaceable.

[HC-1] Demographic Patterns in Cybersecurity Culture: Insights from a Global Organisation Supporting Safety-Critical and Critical Infrastructure Sectors

链接: https://arxiv.org/abs/2606.14462
作者: Tita Alissa Bach,Amandine Kaiser
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study investigates demographic differences in cybersecurity culture in a large global organisation supporting safety critical and critical infrastructure sectors to target CSC improvement. A global survey was administered to all internal and external employees of a total of 21148 employees, with 6502 responses. The questionnaire evaluates nine CSC dimensions such as Password Management, Governance, Email Use. Anonymous survey responses were analysed using Kruskal-Wallis tests and Dunns post hoc comparisons to identify differences across demographic variables including employment, recruitment paths, managerial role, gender, age, tenure, and work base. CSC was broadly consistent across the organisation, with statistically significant but small to moderate demographic effects. CSC variations were observed across employment, age, recruitment paths, and line managerial role. In general, fulltime, internal, permanent, older employees, Merge and Acquisition recruits, and line managers consistently scored higher across multiple CSC dimensions. Parttime, younger, external employees, and those with 6 to 20 years of tenure in general scored lower. These patterns highlight higher-scoring groups that may act as CSC carriers and lower-scoring groups that may benefit from tailored improvement measures, enabling organisational learning. Our study offers a practical, scalable way to assess CSC, generating meaningful insights despite industrial constraints. It enables organisations to benchmark maturity, identify gaps, and prioritise targeted improvements using workforce diversity as a guide.

[HC-2] A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions

链接: https://arxiv.org/abs/2606.14460
作者: Kehinde Temitayo Soetan
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 17 pages, 4 tables, appendices A-E, preprint

点击查看摘要

Abstract:Transformer-based clinical language models are increasingly integrated into high-stakes clinical decision support pipelines, yet the computational mechanisms through which demographic associations encoded in medical documentation propagate into model probability distributions remain empirically underspecified. We present a systematic computational audit of representational bias in ClinicalBERT (Alsentzer et al., 2019), a BERT-based model pretrained on MIMIC-III discharge summaries, employing two complementary probing methodologies: Log Probability Bias Analysis (LPBA), which quantifies demographic descriptor-induced shifts in masked token probability distributions across behavioral and evaluative semantic categories, and Masked Language Model-based analysis (MLM), which probes internal representational structure for demographic agency attribution encoding across 98 real clinical sentence templates and eight intersectional race-gender combinations. Corpus frequency analysis operationalizes the distinction between statistical disparity and bias amplification by benchmarking model outputs against empirical term frequencies in the MIMIC-III training corpus. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing, providing direct empirical evidence that representational bias in ClinicalBERT operates predominantly through model-internal amplification rather than training data inheritance. Keywords: natural language processing, clinical documentation, algorithmic auditing, representational bias, health equity 1

[HC-3] ap: A File-Based Protocol for Heterogeneous LLM Agent Collaboration

链接: https://arxiv.org/abs/2606.14445
作者: Minseo Kim
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to KCC 2026. English archival translation. 3 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Existing multi-agent software development systems have proposed many forms of agent collaboration, including role-based collaboration and automated code review. However, many systems assume a common runtime, a central conversation server, or the same API family. Under these assumptions, LLM agents from different vendors cannot easily exchange messages directly from their own execution environments while dividing development and review work on a shared codebase. This paper presents tap, a file-based collaboration protocol that allows Claude (Anthropic) and Codex (OpenAI) to collaborate on one codebase without shared memory or an identical runtime. The core of tap is a file-first design that preserves markdown files with metadata as original messages, combines a file inspection path (file communication, Tier 1) with real-time notification paths for Claude and Codex (real-time communication, Tier 2), and isolates work through separate git worktrees. Even if real-time notification fails or a receiver restarts, the message file remains available and the same content can be inspected again. In a 27-day, 37-generation self-applied operation where tap was used to develop and review itself, we collected 209 tap-related pull requests and 717 operational artifacts. An analysis of 375 review artifacts showed that the share of reviews recording at least one defect or requested change was 69.8% for heterogeneous model pairs and 53.1% for homogeneous model pairs. These results show that tap, which combines file-based message preservation with real-time notification, operates in a real production repository, and that combining heterogeneous models and execution environments can broaden review perspectives. tap is distributed as the open-source npm package @hua-labs/tap (v0.5.2).

[HC-4] ForestBack: Breadcrumb-Based Pedestrian Dead Reckoning for Infrastructure-Free Return Navigation

链接: https://arxiv.org/abs/2606.14421
作者: Aueaphum Aueawatthanaphisut,Chanakan Chaipan
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注: 9 pages, 6 figures, 1 table, and 19 equations

点击查看摘要

Abstract:Reliable return navigation remains an important challenge in GPS-denied environments where external positioning infrastructure may be unavailable or unreliable. This paper presents ForestBack, an infrastructure-free pedestrian return navigation framework based on breadcrumb-based pedestrian dead reckoning (PDR). The system records a user’s walking route as a sequence of reversible breadcrumb nodes and generates reverse-path guidance without requiring GPS, Wi-Fi, Bluetooth beacons, or pre-installed infrastructure. ForestBack integrates acceleration-based step detection, adaptive step-length estimation, magnetometer-assisted heading estimation, barometric-altitude correction, and bidirectional breadcrumb path reconstruction. The system was evaluated using an indoor obstacle-avoidance route with five checkpoints, where the user navigated around a central obstacle. A dataset of 36 walking trials and 42,474 time-series samples was used for evaluation, including IMU signals, magnetometer readings, barometric variables, turn-event labels, ground-truth trajectories, baseline PDR outputs, proposed ForestBack outputs, and power-related measurements. Experimental results show that ForestBack reduced the mean RMSE from 1.129 m to 0.965 m compared with traditional PDR, corresponding to a 15.76% improvement. The mean final-position error was reduced from 1.781 m to 1.388 m, while turn-event detection consistency reached approximately 99.90%. These results indicate that ForestBack improves trajectory reconstruction and route-preserving return guidance in obstacle-avoidance scenarios. The released dataset and analysis notebook support reproducibility and future benchmarking of infrastructure-free PDR-based return navigation systems.

[HC-5] Fabula: Building a Narrative Storytelling Sidekick with the Writers Community

链接: https://arxiv.org/abs/2606.14411
作者: Piotr Mirowski,Ben Wedin,Reinald Kim Amplayo,Rich Galt,Duncan Williams,Rida Qadri,Jaume Sanchez-Elias,Erin Drake-Kajioka,Sian Gooding,Lucia Lopez-Rivilla,Joao G. M. Araujo,Lion Schulz,Satinder Baveja,Shakir Mohamed,Edward Grefenstette,Laura Rimell,Richard Evans
类目: Human-Computer Interaction (cs.HC)
备注: 41 pages, 10 figures

点击查看摘要

Abstract:We design and evaluate Fabula, an interactive app for fiction writers. Fabula uses detailed narrative plans informed by general narratological theory. Stories are structured hierarchically into scenes and beats that can be (re)generated and revised at script and story plan level. Using participatory AI, we critically evaluate and improve Fabula with casual and published writers, via design interviews and writing sessions with 42 experts, and large-scale internal and external testing. We interrogate our design choices: (1) whether a language model-based auto-evaluator, optimized on human experts’ preferences, can improve story quality, (2) whether users want UI that exposes the detailed narrative plan alongside the story script, (3) to what extent our narratology assumptions fit localised storytelling traditions and serve screenwriters or playwrights, and (4) whether convergent iteration over the story plan supports writers’ creativity. Building on critical feedback and concerns, we use Fabula as a cultural probe in adversarial design, and identify potentials for writing feedback and for interactive storytelling.

[HC-6] Friction in AI-Assisted Clinical Decision-Making: A Case Study on The Role of Questions and What-if Scenarios

链接: https://arxiv.org/abs/2606.14406
作者: Simon WS Fischer,Hanna Schraffenberger,Miranda L. van Hooff,Serge Thill,Pim Haselager
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Clinical decision-making is augmented by decision-support systems (DSSs). To counter overreliance on DSSs, several methods have been proposed that create friction in order to promote cognitive engagement and reflection. In this paper, we investigate how two such forms of friction, namely data-driven questions and what-if' analysis, are perceived by medical experts. For a real-world decision task, we replicated a DSS used in clinical practice and gathered clinicians' feedback on a prototype through in-situ interviews (n=7). Our findings suggest that while the questions were perceived as unhelpful for reflective thinking, they could serve as reminders to consider relevant information. Furthermore, inspecting what-if’ hypotheticals was found useful for potentially improving patient care. Clinicians saw our prototype as a promising training tool for novice clinicians. From the clinicians’ feedback, we make recommendations for designing friction in work practices. Our work contributes to human-AI interaction research, which aims to encourage reflection to mitigate AI overreliance.

[HC-7] hinking Outside the [Chat]Box: Bridging Computer Science and Industrial Design for Cognitive-Inclusive Generative AI

链接: https://arxiv.org/abs/2606.14306
作者: Virginia Francisco,Daniel Guasch,Raquel Hervás
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Generative AI (GenAI) interfaces remain largely constrained to chatbox interaction, which can impose high cognitive demands on users and create substantial barriers for people with intellectual disabilities (ID), including prompt formulation difficulties, response overload, and limited mechanisms to assess information reliability. To explore alternative interaction models for cognitive accessibility, we conducted a cross-disciplinary co-design challenge in which two student cohorts (Computer Science and Industrial Design) developed interface concepts from the same set of functional requirements (e.g., prompt scaffolding, structured output, GUI-based refinement, transparency, and personalization). Comparing the resulting proposals reveals both convergence on foundational requirements (notably initial calibration, proactive prompting, and direct manipulation of response fragments) and complementary contributions that outline a multi-layered support system. Computer Science teams primarily produced structural scaffolding, emphasizing predictability, navigability, and trust through mechanisms such as reliability indicators, explicit sources, and context management for long conversations. Industrial Design teams emphasized experiential scaffolding, focusing on pacing, attention guidance, multimodality, and proactive agency, including step-by-step response flows, focus modes, and assistant-like integrations. We synthesize these findings into a dual-layer scaffolding framework that expands the design space for cognitively accessible GenAI interaction beyond chat-centric models and motivates future work on expert refinement, technical feasibility, and empirical validation with users with ID.

[HC-8] Visible Adoption Untracked Contribution: GitHub Evidence of the Accountability Gap Across Three Cohorts of an HCI Prototyping Course

链接: https://arxiv.org/abs/2606.14054
作者: Maria Teresa Parreira,Pranav Prabhat Sinha,Hauke Sandhaus,Wendy Ju
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper presents a longitudinal, observational case study of how student GenAI adoption shifted across three cohorts (Fall 2022, 2023, and 2025) of the same graduate-level HCI prototyping course, using computational analysis of 203 GitHub repositories with student activity and 23,065 student commits. Building on a prior qualitative study of the 2023 cohort, we distinguish two levels of AI accountability trace: disclosure (naming that an AI tool was used) and attribution (crediting a specific artifact or task to an AI tool). We find that tool disclosure grew from 0% to 66% of repositories across the three cohorts, while explicit contribution attribution remains a minority practice, and the gap between the two reveals where accountability is missing even among students who disclose. By 2025, AI is infrastructure embedded in course templates and student-built devices: students increasingly name the tools they used, but rarely specify what those tools contributed. We argue that disclosure-based frameworks are insufficient for the vibe-coding era. The failure is not that students conceal AI use; it is that a norm built for episodic, identifiable acts cannot capture continuous, ambient co-creation. We offer this case study as grounding for the workshop’s conversation about what genuine co-thinking accountability looks like.

[HC-9] he Silent Cost of Artificial Intelligence Assistance: A Theory of Autonomy Surrender the Recovery Mechanism and the Restoration of Human Agency

链接: https://arxiv.org/abs/2606.13962
作者: Ancuta Margondai,Julie Rader,Emma Rader,Sara Willox,Mustapha Mouloua
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure. Submitted version

点击查看摘要

Abstract:The integration of artificial intelligence into human decision-making environments has introduced a previously undertheorized cost: the gradual surrender of human autonomy in exchange for access to information and computational assistance. Building on the Human Identity and Autonomy Gap (HIAG) framework, this paper advances a theoretical model of autonomy surrender as a measurable, cumulative process driven by cognitive bandwidth depletion. The model proposes three interacting mechanisms: the silent cost of AI assistance, in which autonomy is transferred incrementally and without awareness; the surrender threshold, beyond which reclaiming autonomous function becomes cognitively and psychologically difficult; and the recovery mechanism, which establishes the design obligation and the ethical responsibility accompanying deliberate human re-assumption of control. The paper argues that human re-entry into the decision loop is not a passive option but an active cognitive event requiring intentional bandwidth restoration. The design of AI systems must incorporate structured re-entry pathways, here termed recovery mechanisms, that preserve human agency while appropriately distributing responsibility. The model further predicts a terminal state, here termed preference inversion, in which functional dependence on AI assistance is experienced not as a deficit but as a preference, transforming the restoration of autonomy from a design problem into a cultural and political one. Implications are drawn for AI system design, governance frameworks, and human factors research.

[HC-10] SpheriCity: Designing Trustworthy Conversational AI for Sustainability Decision Support

链接: https://arxiv.org/abs/2606.13854
作者: Ahmed Qayyum,Madison Werner,Kathryn Youngblood,Jenna R. Jambeck,Tahiya Chowdhury
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS '26)

点击查看摘要

Abstract:We present SpheriCity, an expert-grounded conversational prototype designed to support trustworthy knowledge sensemaking from sustainability reports. City-level circularity assessment reports contain rich information about materials, infrastructure, and policy interventions, yet their length and heterogeneous structure make cross-document synthesis and comparison difficult for practitioners and researchers working on circular economy initiatives. While large language models (LLM) promise faster knowledge access and synthesis, their opaque reasoning, hallucinations, and lack of source transparency introduce risks for trust and interpretability, and require verification in high-stakes sustainability contexts. SpheriCity addresses these challenges through a provenance-first conversational agent that foregrounds evidence traceability, structured synthesis, and interaction scaffolds to support exploratory querying and cross-document synthesis across sustainability reports. We conducted a formative expert review with six sustainability experts using representative queries spanning cross-city comparison, policy summarization, and recommendation-oriented tasks. Experts evaluated responses across dimensions and provided qualitative reflections on the system’s usefulness for sustainability knowledge work. Our results reveal that transparent sourcing, contextual explanation, interpretability, and alignment with expert workflow strongly shape expert trust and judgments of system usefulness. This work contributes (1) a conversational prototype for sustainability knowledge sensemaking, (2) an expert-grounded evaluation framework for assessing AI responses in high-stakes knowledge domains, and (3) design insights into how provenance, uncertainty communication, and integration in workflow influence expert users’ trust in AI assistance for sustainability decision support.

[HC-11] Rethinking the UI of GenUI: A Tale of Two Designs

链接: https://arxiv.org/abs/2606.13843
作者: Xiang `Anthony’ Chen,Savvas Dimitrios Petridis,Tian Deng,Humad Bari,Ruofei Du,Yang Li
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:GenUI is an emergent class of AI tools that use large models to generate UI mock-ups based on users’ high-level descriptions, promising to democratize UX design exploration for a broader audience. Most GenUI designs to date tend to inherit the conventions of conversational large models, such as ChatGPT and Gemini, where a user describes their design needs primarily via an unstructured prompt, and the tool then takes a depth-first approach, delving into the design right away and producing a high-fidelity prototype. In this research, we rethink how well this unstructured, depth-first, and high-fidelity GenUI design can support early-stage, 0-to-1 design exploration. To probe this question, we propose a contrastive design with structured input, breadth-first exploration, and low-fidelity generation. We then conducted a comparison study with 24 UX designers and product managers who conducted mini design exploration exercises using an existing GenUI tool and our contrastive GenUI tool. Findings reveal participants’ perceived benefits and trade-offs of the two GenUI designs: structured input surfaces key facets but requires more work, raising entry barriers to start exploration; breadth-first workflow reveals more possibilities, but previewing UX ideas spanning many screens remains hard; and though low fidelity has value, professionals favor high fidelity because it fits practice and GenAI heightens fidelity expectations. We conclude with design implications for GenUI and similar AI-powered creativity support tools.

[HC-12] A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets ICML2026

链接: https://arxiv.org/abs/2606.13802
作者: Tejas Agrawal,Vu Le,Sumit Gulwani,Gust Verbruggen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted at ICML 2026. Code and benchmark: this https URL

点击查看摘要

Abstract:Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.

[HC-13] he Frustrometer: Detecting User Frustration in Data Visualization Tasks using Biomarkers and Interaction Patterns

链接: https://arxiv.org/abs/2606.13687
作者: Johannes Ellemose,Sophia Wanner,Djordje Slijepčević,Laura Cesar,Vanessa Leung,Wolfgang Aigner,Niklas Elmqvist
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages, 8 figures, 3 page appendix

点击查看摘要

Abstract:Visualization research has largely solved \textithow to help a stuck or frustrated user – through interactive onboarding, contextual help, and active guidance. The unsolved problem is \textitwhen: trigger help too eagerly and you break the user’s train of thought; wait too long and they have already gone astray. We present the \textscFrustrometer, a series of experiments to predict user stuckness and frustration by fusing physiological and interaction signals. The Frustrometer consists of a convolutional neural network classifier, that in real-time estimates whether user are stuck in their task or not. We collected data from a controlled study where 14 participants performed analytical tasks on two interactive visualization dashboards while we captured eye movement, pupil dilation, galvanic skin response, heart-rate, head orientation, mouse dynamics, and keyboard events. In addition participants assessed their own performance, while we annotated when during the tasks the participants were stuck. Our results reveal that autonomous physiological responses such as heart-rate and galvanic skin response provide limited insights into the frustration level of the user. Similarly, head orientations are not easily correlated with the frustrations felt by the user during visual analysis tasks. Mouse movements and gaze data conversely carry the majority of predictive signal, with mouse movements alone having a strong correlation for some participants, suggesting that lightweight instrumentation may suffice for real-time frustration detection. We end the paper by discussing how these findings can inform the design of adaptive guidance systems for complex visualization tasks that takes a multimodal approach to frustration and stuckness detection.

计算机视觉

[CV-0] OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

链接: https://arxiv.org/abs/2606.14702
作者: Xinyue Cai,Chaoyou Fu,Yi-Fan Zhang,Ran He,Caifeng Shan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA’’ paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbfEntity-Anchored Video Scripting transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbfClue-Guided QA Generation prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbfOmniVideo-100K and a human-verified test set, \textbfOmniVideo-Test. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

[CV-1] RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

链接: https://arxiv.org/abs/2606.14701
作者: Timing Yang,Predrag Neskovic,Jansen Seheult,Wenchao Han,Anand Bhattad,Alan Yuille,Feng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When humans see a bird, they recognize far more than just “bird” – they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L-N-N-L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.

[CV-2] RepFusion: Leverag ing Multimodal Priors for Denoising in Representation Space

链接: https://arxiv.org/abs/2606.14700
作者: Xichen Pan,Aashu Singh,Satya Narayan Shukla,Xiangjun Fan,Shlok Kumar Mishra,Saining Xie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.

[CV-3] Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

链接: https://arxiv.org/abs/2606.14699
作者: Ruining Li,Yuxin Yao,Matt Zhou,Chuanxia Zheng,Christian Rupprecht,Joan Lasenby,Shangzhe Wu,Andrea Vedaldi
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.

[CV-4] CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification

链接: https://arxiv.org/abs/2606.14686
作者: Rafi Ahamed,Md. Abir Rahman,Tasnia Tarannum Roza,Munaia Jannat Easha,Md. Asif Khan,Sudeepta Mandal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper contains 11 figures and 4 tables. It was Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

点击查看摘要

Abstract:Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of “CottonLeafVision” is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks, including DenseNet201, InceptionV3, and VGG19 on a publicly available cotton leaf disease image dataset. This image dataset includes seven classes, six disease classes, and one healthy class, collected under various field conditions reflecting real-world challenges. Among these pretrained models, with DenseNet201, we have achieved the highest classification accuracy of 98%. To enhance the model reliability and interpretability, we have implemented different techniques and methods such as Gradient-weighted Class Activation Mapping (Grad-CAM), occlusion sensitivity analysis and adversarial training to increase the noise resistance of the model. Finally, we have developed a prototype in order to utilize the model’s capabilities on real life agriculture. This paper shows the deep learning model’s capabilities to classify the disease in real-life cotton disease management situations.

[CV-5] HumP-KD: A Hybrid Uncertainty-Aware Multi-Stage Progressive Knowledge Distillation Framework for Efficient Fire Classification

链接: https://arxiv.org/abs/2606.14684
作者: Mohammed Arif Mainuddin,Najifa Tabassum,Omar Ibne Shahid,Riasat Khan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Real-time fire classification systems require models that are simultaneously accurate, computationally efficient, and deployable on resource-constrained hardware. This work proposes \textbfHumP-KD, a Hybrid Uncertainty-aware Multi-stage Progressive Knowledge Distillation framework for efficient fire classification. Two datasets, FlameVision and Dataset-II, containing 8,600 and 31,309 images, are used. Various CNN and transformer baselines are applied under standard preprocessing, online augmentation, Gaussian noise and motion blur robustness conditions. The proposed HumP-KD model distills knowledge from two frozen heterogeneous transformer teachers, Swin-Tiny and ViT-Base, along with their Meta-MLP ensemble, into a lightweight MobileViT-S student via three tightly integrated components. Hierarchical Progressive Knowledge Distillation employs a Hierarchical Feature Builder. It generates a fused spatial attention mask to guide distillation toward discriminative regions selectively. Multi-Stage Knowledge Distillation progressively activates three distillation stages across training. On Dataset-II, HumP-KD achieves a mean F1 score of 0.9876 \pm 0.0063 across 10 independent trials, significantly outperforming the MobileViT-S baseline trained without distillation ( 0.9537 \pm 0.0351 ), with statistical significance confirmed by both independent t-test ( p = 0.0195 ) and Wilcoxon signed-rank test ( W = 1 , p = 0.0039 ). The proposed method also demonstrates strong generalization across datasets and robustness under degraded visual conditions. The student model retains only 4.94M parameters and 19.01Mb model size, representing a 5.7\times parameter reduction over Swin-Tiny and a 17.5\times reduction over ViT-Base, while achieving 37.72 CPU FPS, making it suitable for real-time deployment.

[CV-6] Memento: Reconstruct to Remember for Consistent Long Video Generation

链接: https://arxiv.org/abs/2606.14667
作者: Xuan Wei,Longbin Ji,Guan Wang,Xiangrui Liu,Zhenyu Zhang,Shuohuan Wang,Yu Sun,Qingqi Hong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

[CV-7] Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications

链接: https://arxiv.org/abs/2606.14658
作者: Nicole Villavicencio-Garduño,Maksim Ekin Eren,Milo Prisbrey,Ben Migliori,Michael Teti
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, SPIE Defense + Security

点击查看摘要

Abstract:Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions the stabilization system was designed to handle, the system introduces artifacts into the frame, causing AI-based CV models to misclassify, miss targets, or hallucinate objects. Previous work used ultrasonic frequencies (20 kHz) to perform short-range attacks, which limits them to short distances due to the attenuation exhibited by high frequencies. In this work, we investigate acoustic attacks using lower frequencies in the audible range (20 kHz), and we further expand our analysis to include how various image and object features are affected by the attacks. Specifically, we performed physical experiments to demonstrate the viability of our attacks on an off-the-shelf object detection model (YOLO11) by resonating a commercially available camera with various frequencies. Based on our results, we provide insights into several factors that make an AI CV system more vulnerable to these attacks, which could help inform the development of future mitigation strategies. Comments: 9 pages, 7 figures, SPIE Defense + Security Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.14658 [cs.CV] (or arXiv:2606.14658v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.14658 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proc. SPIE 14046, Assurance and Security for AI-enabled Systems 2026, 1404609 (10 Jun 2026) Related DOI: https://doi.org/10.1117/12.3093699 Focus to learn more DOI(s) linking to related resources

[CV-8] HPSv3: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities

链接: https://arxiv.org/abs/2606.14657
作者: Yijun Liu,Jie Huang,Zeyue Xue,Yuming Li,Ruizhe He,Haoran Li,Shijia Ge,Siming Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose HPSv3++, a reward model framework that elevates the HPSv3 model for varying T2I model capabilities and their RL iteration changes across the full capability-iteration spectrum. Specifically, we first introduce HPDv3++, a 212K dual-dimension preference dataset annotated for text fidelity and aesthetic quality using a recent high-capability (Qwen-Image) model with human supervision. We then propose a two-stage training framework. Stage 1 employs data-aware orthogonal gradient projection to incorporate diverse aesthetic perception from HPDv3++ while preserving the original effective human preference knowledge in HPSv3. Stage 2 further leverages unlabeled data from T2I models spanning different capability levels and RL iterations, and introduces a joint capability-iterations conditioned signal for the reward model together with a standard deviation-driven unsupervised guidance mechanism, strengthening reward model across the capability-iteration spectrum. HPSv3++ achieves state-of-the-art preference prediction, outperforming HPSv3 9.8% on HPDv3, 5.5% on GenAI-Bench, while achieving 79.1%/88.1% on our proposed HPDv3++. When used for T2I RL training, it consistently improves GenEval scores across diverse T2I models, demonstrating its wide-range capabilities. The code is available at this https URL.

[CV-9] Improving Lunar Topography with Deep Learning Schrödinger Bridges

链接: https://arxiv.org/abs/2606.14638
作者: Matthew Repasky,Erwan Mazarico,Michael K. Barker,Stefano Bertone,Terence J. Sabaka,Yao Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Earth and Planetary Astrophysics (astro-ph.EP)
备注:

点击查看摘要

Abstract:Increasing the resolution of planetary topography models can enable a better understanding of surface processes and geomorphology; however, existing analytical super-resolution methods are expensive and difficult to apply at large scales. Generative models provide the tools to learn complex relationships within data and can be applied at scale due to hardware accelerators and parallelization. We present a diffusion-based Schrödinger Bridge (SB) generative modeling approach for lunar topography super-resolution, connecting the distribution of low-resolution topography to that of high-resolution topography, incorporating physically-constraining optical imagery. Our approach is inspired by existing Shape-from-Shading methods, which improve a priori low-resolution topography by using optical images at the target resolution. We train SBs on a novel dataset of rendered lunar topography, emulating optical imagery from the Lunar Reconnaissance Orbiter Narrow Angle Camera. The result is a flexible approach for topography super-resolution which can provide pixel-level uncertainties in the reconstruction.

[CV-10] SED:Lightweight Saliency prediction for Event-based data via Distillation

链接: https://arxiv.org/abs/2606.14631
作者: Romaric Mazna,Jean Martinet,Michele Magno
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based saliency prediction has gained attention recently, as combining event cameras with saliency estimation can act as an upstream stage that naturally improves the efficiency of downstream eventbased perception at the edge. However, current approaches are either neuromorphic, underperforming on event-based saliency benchmarks, or too heavy for resource-constrained edge applications due to their reliance on transformers or 3D convolutions. Drawing inspiration from efficient convolutional modules, SED and aiming to exploit the temporal information in event data, we propose a lightweight network, trained through knowledge distillation, built on a Depthwise Spatio-Temporal Block (DSTconv) – a factorization of the 3D depthwise separable convolution. Relative to its teacher, our model reduces the model size from 180 MB to 0.32 MB (562x) and the parameter count from 45M to 81k (554x), while matching or outperforming it on the N-DHF1K and N-UCF Sports datasets. Moreover, it generalizes strongly beyond its training distribution, transferring from synthetic to real event data where a model trained from scratch fails.

[CV-11] StereoGeo: an end-to-end stereo camera calibration method

链接: https://arxiv.org/abs/2606.14619
作者: Imane Meddour,Andréa Macario Barros,Cédric Gouy-Pailler
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure, accepted at the 34th European Signal Processing Conference (EUSIPCO 2026)

点击查看摘要

Abstract:In this work, we propose StereoGeo, an end-to-end network-based approach for stereo camera calibration. Our method estimates the focal lengths and gravity directions of the left and right cameras, as well as the relative extrinsic transformation relating them. Existing methods often rely on calibration patterns in structured environments or address only a single camera configuration, being limited to either intrinsic or extrinsic estimation, and depending on a multi-view setups. StereoGeo extends the GeoCalib algorithm, integrating deep neural network feature extraction with a differentiable optimizer. Extensive experiments on real-world benchmarks demonstrate that StereoGeo achieves competitive performance for intrinsic calibration and provides accurate stereo extrinsic estimation, outperforming existing methods that are limited to monocular settings. The dataset used in this work is partially publicly available at this https URL.

[CV-12] S2COPE: Self-Supervised Concept Discovery via Preference Learning

链接: https://arxiv.org/abs/2606.14586
作者: Shilong Xiang,Zirui Zhang,Chengzhi Mao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current representation learning paradigms force a fundamental compromise: self-supervised methods scale to massive datasets but yield opaque features, whereas interpretable models remain bottlenecked by the need for dense human annotation. We introduce Self-Supervised Concept discOvery via Preference lEarning (\model), a label-free framework that resolves this dilemma. Instead of treating Vision-Large-Language Models (VLLMs) as static feature extractors, \model leverages them as active participants in a self-supervised preference optimization loop. By autonomously hypothesizing, validating, and reinforcing candidate visual attributes directly from raw imagery, our framework discovers novel, structured concepts without a single label. Extensive experiments across natural, medical, and physics domains demonstrate that \model successfully extracts domain-specific concepts where standard VLLMs often fail to generate. By amortizing concept discovery directly into the VLLM backbone through our self-supervised preference objective – rather than relying on static generation and disjoint filtering – we achieve up to a 24-point absolute improvement in downstream top-1 classification accuracy on unseen data. Our work suggest that interpretability can emerge through a model’s autonomous interaction with incidental visual structures, without any human supervision.

[CV-13] A Qualitative Review of GenAI-Based Methods for Data Generation and Augmentation in Industrial Computer Vision Applications

链接: https://arxiv.org/abs/2606.14578
作者: Paul Koch,Paul Hofmann,Ferdinand Waßelewsky,Adem Karakurt,Andre Sérs,Jörg Krüger
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Computing Conference 2026

点击查看摘要

Abstract:AI-driven computer vision applications require a profound database to ensure predictable behaviors and performance. Such predictable behaviors are especially important for industrial applications in gaining trust from users. However, such a database is not readily available in industrial applications, and its acquisition is not trivial either. Active learning methods can be applied to ramp up data within a project deployment to iteratively increase the database, and thus the application predictability. Unfortunately, we observe that this often leads to a loss of user trust in the application, which is difficult to regain once lost. This leads to a “chicken-and-egg” dilemma in which neither the database nor the application is developed. In this work, we review state-of-the-art methods and approaches to further boost the database the initial active data ramp-up phase. Here, we focus on recent advancements in GenAI-based data generation and augmentation methods and review their adaptability on an industrial computer vision classification use case. Although we observe a potential for automatic data ramp-up, we also see a domain miss match in between the source (training environment) and target (industrial use-case) - regarding context defined in natural language and object characteristics.

[CV-14] NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests

链接: https://arxiv.org/abs/2606.14562
作者: Constanza A. Molina Catricheo,Simon Boeder,Ting-Jia Guo,Giacomo May,Clément Berthelot,Devis Tuia,Friedrich Fedor Reinhard,Fabio Remondino,Benjamin Risse
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 4 figures. Dataset available at this https URL

点击查看摘要

Abstract:Sociable weaver nests function as complex ecological structures offering thermoregulatory microhabitats and sustaining diverse species; however, datasets used in prior studies lack fine-grained 3D structural detail. Producing usable and accurate 3D weaver nest data is challenging due to their irregular geometry and integration with complex host vegetation. We bridge this gap with an open-access, 1.4 TB multimodal drone dataset of 104 nest-bearing trees, comprising 27,945 RGB images, 111,780 multispectral images, approximately 781 million 3D points, and expert-annotated semantic segmentation labels. We benchmark semantic segmentation using KPConv, RandLA-Net, and Point Transformer V3, with PT-v3 achieving an mIoU of 86.35% on the test set. While the results demonstrate strong performance for transformer-based and point-wise methods, they also highlight architecture-dependent challenges, particularly for convolution-based approaches such as KPConv. By uniquely combining spectral, spatial, and structural information, the presented dataset advances 3D reconstruction, segmentation, and classification algorithms, enabling ecological applications from nest volume estimation to species conservation, and serves as a demanding benchmark that exposes architecture-dependent performance under extreme class imbalance.

[CV-15] Visual Quality Score Assessment of Large White Goods in Remanufacture with Multi-View Deformable-DETR

链接: https://arxiv.org/abs/2606.14556
作者: Paul Koch,Vivek Chavan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to GCSM 2026

点击查看摘要

Abstract:Remanufacturing large white goods is essential for a circular economy, yet visual quality assessment remains a manual bottleneck for training and pricing. Conventional detection methods require extensive annotation and struggle with small defects in high-resolution multi-view data. We present a multi-view framework based on Deformable-DETR for automated quality scoring that aggregates information across redundant views to extract fine-grained features. To enhance robustness with limited labels, we employ self-supervised pretraining followed by supervised fine-tuning on expert-annotated scores. Additionally, a linear projection over frozen feature maps identifies regions of interest to explain model decisions. Evaluated on an industrial multi-view dataset, our approach delivers precise quality assessments while reducing reliance on manual annotation and per-part customization, enabling scalable and transparent inspection for remanufacturing lines.

[CV-16] Rethinking Global Averag e Pooling: Your Classifier Is Secretly a Multi-Instance Learner

链接: https://arxiv.org/abs/2606.14555
作者: Aray Karjauv
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

[CV-17] A Lightweight Fiducial-Based Pipeline for 3D Hyperspectral Mapping of ex-vivo Lumpectomy Specimens

链接: https://arxiv.org/abs/2606.14534
作者: Anna Bicchi,Alberto Rota,Leonardo Passoni,Nicola Ancellotti,Andrea Peroni,Lorenzo Vinco,Dario Polli,Elena De Momi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral Imaging (HSI) is a promising modality for intraoperative assessment of resection margins in Breast-Conserving Surgery (BCS), but its clinical translation requires aligning the inherently 2D spectral information onto the 3D shape of the excised tissue so that suspicious regions can be precisely localized for targeted follow-up. We present a fully automated, calibration-free pipeline that produces a 3D hyperspectral point cloud of an ex-vivo lumpectomy specimen from a set of consumer-camera RGB images and a single top-down HSI acquisition. The 3D geometry is reconstructed with a deep-learning Structure-from-Motion backbone, stabilized in a metric reference frame by a custom bundle adjustment that enforces consistency on the corners of four ArUco markers placed around the specimen. The HSI cube is then registered to the reconstruction without recovering the HSI camera pose: the markers, visible in both modalities, define 16 corner correspondences that drive a planar homography, and 3D coordinates are recovered by lookup on an orthographically rendered depth map. Evaluated on two ex-vivo lumpectomy specimens, the pipeline achieves a median 3D registration error below 1~mm and a 2D reprojection error below 0.02 mm, with a total per-specimen processing time under 4 minutes on accelerated hardware. These results support the feasibility of integrating HSI-guided spatial localization into intraoperative margin assessment workflows for breast-conserving surgery.

[CV-18] Scratched Lenses Shifted Depth: Passive Camera-Side Optical Attacks

链接: https://arxiv.org/abs/2606.14504
作者: Qinlin He,Zeming Zhuang,Yongji Wu,Lan Zhang,Xiaoyong(Brian)Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physical adversarial attacks on vision systems are typically studied through scene manipulation, such as adversarial patches or projections, where the adversary controls what the camera observes. Camera-side attacks using stickers or auxiliary optics have also been explored, but they treat attacks as image-space perturbations from designed patterns. This misses how physical imperfections interact with scene-dependent lighting and optics. We identify a threat: passive lens-side damage that is persistent yet trigger-conditioned, producing optical artifacts that bias geometric inference under particular visual conditions. We instantiate this threat through Scratch-induced Lens Adversarial Streak Hijacking SLASH, a physical-world attack caused by small scratches on a camera lens or protective cover. Scratches interact with bright light sources and specular reflections to create structured streak artifacts that distort depth cues. Since the perturbation is fixed in the optical path but triggered by the scene, it is both persistent and selective. We formulate the attack in optical space, model the scratch pattern as a trigger-conditioned optical channel, and optimize one fixed configuration across diverse viewing conditions. We evaluate SLASH on monocular depth estimation and monocular 3D object detection in digital and real-world settings. Under the fixed-scratch constraint, directional depth shifts reach up to 32% relative error for monocular depth estimation, with consistent effects on monocular 3D object detection. Physical experiments confirm transfer to real camera recordings, inducing depth shifts above the model’s natural prediction baseline. These findings reveal an attack surface where benign-looking hardware imperfections act as latent, scene-triggered adversarial mechanisms, challenging assumptions about physical robustness and motivating defenses for secure vision systems. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.14504 [cs.CV] (or arXiv:2606.14504v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.14504 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-19] Value-order Decomposition for Generalist Anomaly Detection

链接: https://arxiv.org/abs/2606.14475
作者: Miaoyun Zhao,Jing Chen,Miaoni Zhao,Qiang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial anomaly detection suffers from limited data, making cross-domain generalization particularly challenging. Generalist Anomaly Detection (GAD) aims to train a unified model on a source domain that can effectively detect anomalies in unseen target domains. In the initial semantic feature space, strong entanglement between anomalies and object categories or defect types hinders effective generalization across domains. Recent works address this issue by projecting features into a residual space; however, such methods primarily increase cross-domain overlap for normal features, while anomalous features remain specific to object categories, defect types and data domains, leading to poor alignment and generalization. To address this limitation, we propose Value-order Decomposition (VOD), a simple yet effective technique that bridges \textbfthree types of generalization gaps across object categories, defect types (including real and synthetic defects), and data domains. VOD disentangles and suppresses object-category-, defect-type-, and domain-specific information, promoting alignment within normal and abnormal samples while preserving their separability, thereby enabling robust generalization across the three gaps. Leveraging the strong alignment between real and synthetic defects within the same object, we perform anomaly detection using only normal and synthetic-abnormal reference, and effectively generalize to unseen real defect types. Experiments on diverse industrial and medical benchmarks demonstrate that our method, using a simple cut-and-paste anomaly simulation strategy, achieves strong generalization across the three gaps.

[CV-20] MooMIns – Monocular 3D Reconstruction and Object Pose Estimation from Multiple Instances

链接: https://arxiv.org/abs/2606.14389
作者: Robert Langendörfer,Markus Hillemann,Markus Ulrich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Simultaneous 3D reconstruction and 6D object pose estimation from a single monocular image is an inherently ill-posed problem. In industrial settings, however, multiple instances of an object are often randomly arranged in bins, implicitly providing several views of the same object within a single image. We show that this implicit multi-view geometry can be exploited to simultaneously reconstruct the object in 3D and estimate the 6D pose of each visible object instance. We present MooMIns, a new Gaussian-splatting-based approach that inverts the original Gaussian splatting formulation: instead of rendering a single scene from multiple cameras, we render multiple object instances from a single camera. Our method is initialized with SAM3 instance segmentation masks and a modified Structure from Motion (SfM) pipeline. In contrast to learned monocular depth estimation, we perform true geometry-based reconstruction from image evidence, avoiding hallucinations caused by training data priors. We evaluate MooMIns on synthetic and real bin-picking scenarios, and demonstrate accurate reconstruction of previously unseen objects as well as reliable pose estimation of individual instance

[CV-21] IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

链接: https://arxiv.org/abs/2606.14383
作者: Haonan Qi,Jin Cao,Yongqi Zhang,Xintong Wang,Weidong Tang,Bin Chen,Chengfu Huo,Haojun Pan,Hengyu You,Jing Li,Yingde Wang,Liang Ding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction – recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86–94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15–34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

[CV-22] FLaRA: Predicting Future Latent Representations for Accident Anticipation ITSC2026

链接: https://arxiv.org/abs/2606.14380
作者: Lorenzo Caselli,Tomaso Trinci,Tommaso Bianconcini,Simone Magistri,Leonardo Taccari,Francesco Sambo,Andrew D. Bagdanov
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

点击查看摘要

Abstract:Anticipating traffic accidents from dashcam videos is a critical challenge in intelligent transportation systems. Existing methods typically map visual context directly to a collision probability without explicitly modeling the future evolution of the driving scene. In this paper we propose FLaRA (Predicting Future Latent Representations for Accident Anticipation), a novel predictive architecture that shifts this paradigm by forecasting future latent representations for accident anticipation. Building upon the Video Joint-Embedding Predictive Architecture (V-JEPA2), our model conditions a predictor network on observed context frames to predict the forthcoming latent features of the scene. A classifier then operates on these predicted future representations rather than only on past observations. To ensure these forecasts remain grounded in realistic future dynamics, we introduce a joint training objective that simultaneously optimizes an auxiliary feature-level reconstruction loss and a cross-entropy classification loss. Extensive evaluations on the Nexar dataset, alongside cross-domain validations on the DAD, DADA-2000, and DoTA benchmarks, demonstrate that our approach achieves state-of-the-art performance while maintaining realistic early warning capabilities.

[CV-23] Point Cloud Upsampling through Patch-based Frequency Superposition

链接: https://arxiv.org/abs/2606.14355
作者: Marina Ritthaler,Azhar Hussian,Vasileios Belagiannis,André Kaup
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In recent years, neural networks have become the dominant models in most point cloud upsampling methods. Although these approaches are achieving good results, they do have drawbacks, such as a lack of interpretability and data dependency. Moreover, they have to be trained on a dataset that is similar to the test data in order to perform well. To avoid these disadvantages, we propose Point Cloud Upsampling through Patch-based Frequency Superposition (PUtPFS), an optimization-based approach that selects subsets of points and estimates the surface of this set through superpositioning spatial frequencies. Then, new points are placed on this surface. By successively selecting points in the least dense regions of the point cloud, a uniform upsampling can be reached. With this method, we surpass the current best upsampling results in the commonly considered point-to-surface distance. Furthermore, we achieve the best Chamfer and Hausdorff distance among the optimization-based approaches. As an additional advantage, our method does not need any training data and is mathematically interpretable.

[CV-24] ForceForget: Reinforcement Concept Removal for Enhancing Safety in Text-to-Image Models ICML2026

链接: https://arxiv.org/abs/2606.14351
作者: Dong Han,Yong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, T2I models still can generate unsafe contents. To alleviate this issue, various concept erasing methods are proposed. However, existing methods tend to excessively erase unsafe concepts and suppress benign concepts contained in harmful prompts, which can negatively affect model utility. In this paper, we focus on eliminating unsafe content while maintaining model capability in safe semantic meaning interpretation by optimizing the concept erasing reward (CER) with reinforcement learning. To avoid overly content erasure, we introduce the Safe Adapter to project partial text embedding for efficient concept regulation in cross-attention layers. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high fidelity of benign images compared with existing state-of-the-art (SOTA) concept erasing methods. In terms of robustness, our method outperforms counterparts against red-teaming tools. Moreover, we showcase the proposed approach is more effective in emerging image-to-image (I2I) scenarios compared with others. Lastly, we extend our method to erase general concepts, such as artistic styles and objects. Disclaimer: This paper includes discussions of sexually explicit content that may be offensive to certain readers. All images used in this work are synthesized or from public datasets.

[CV-25] CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation

链接: https://arxiv.org/abs/2606.14317
作者: Sihan Zhuang,Xinyuan Chen,Tianfan Xue,Yaohui Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion-based video generation have significantly improved visual quality and short-term temporal coherence. However, existing methods still struggle to produce videos with physically consistent and causally plausible dynamics, especially in scenarios involving long-horizon interactions. This limitation arises from the fact that video diffusion models primarily learn physical consistency implicitly, while vision-language models can directly model physical laws. Based on this idea, in this work, we propose \textbfCausalMotion, a training-free framework that injects explicit physical reasoning into video generation through structured intermediate representations. Our key idea is to decouple reasoning from generation by leveraging a vision-language model to decompose a text prompt into a sequence of causally consistent keyframes and object-centric motion trajectories. These representations are then aligned and integrated as soft constraints to guide a pretrained video diffusion model during inference. This design enables explicit modeling of object dynamics and causal transitions without requiring additional training or supervision. Extensive experiments show that our method consistently improves physical plausibility and temporal coherence, particularly in dynamics-intensive scenarios, while maintaining high perceptual video quality.

[CV-26] Pano3D: Unified 3D Reconstruction and Panoptic Segmentation

链接: https://arxiv.org/abs/2606.14307
作者: Victor Barberteguy,Ahmet Iscen,Mathilde Caron,Alireza Fathi,Gül Varol,Cordelia Schmid
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in 3D feedforward reconstruction neural networks have achieved remarkable success in dense reconstruction from images without any camera parameters. Yet, equipping these models with robust semantic understanding remains an open problem. Here we introduce an approach that performs 3D reconstruction and 3D panoptic segmentation in a unified framework. We build on existing 3D reconstruction models and augment them with a set-based mask decoder. The approach is jointly trained with a geometric and semantic loss, which are shown to be mutually beneficial. More precisely, the features are initialized from the geometric information and then finetuned to capture jointly geometry and semantics. We demonstrate the generality of our approach by successfully applying our framework both to online and all-to-all attention reconstruction backbones. Our method achieves state-of-the-art performance in 3D panoptic segmentation across ScanNet, ScanNet200, and ScanNet++ datasets. Ablation studies show that such joint training of a unified model equips 3D feedforward reconstruction neural networks with panoptic segmentation and yields mutually beneficial improvements.

[CV-27] What Drives Test-Time Adaptation for CLIP? A Controlled Empirical Study from an Update Perspective

链接: https://arxiv.org/abs/2606.14299
作者: Jiazhen Huang,Xiao Chen,Zhiming Liu,Yaru Sun,Jingyan Jiang,Zhi Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) such as CLIP have become a standard backbone for open-vocabulary recognition, yet their zero-shot predictions remain vulnerable to distribution shifts encountered at deployment. Test-Time Adaptation (TTA) has recently been extended to CLIP as a lightweight solution, leading to a rapidly growing body of TTA4CLIP methods. However, empirical progress in this area has largely outpaced our understanding of what truly drives adaptation, where their gains originate, and under which shifts they remain reliable. In this paper, we take a step back from the pursuit of state-of-the-art accuracy and conduct a systematic controlled study of TTA4CLIP. We first organize existing methods into three unified paradigms according to what is updated at test time. We then introduce TTABC, an open-source TTA Benchmark for CLIP, which standardizes evaluation protocols and integrates more than 20 representative methods. Our controlled empirical analysis focuses on three key areas. First, we determine the driving factors in parameter-based methods, revealing that adaptation gains are primarily driven by test-time evidence and reliable proxies rather than heavy optimization. Second, we explore evidence utilization beyond heavy parameter tuning, showing that competitive and efficient performance can be achieved through cross- or current-sample evidence and lightweight prototype updates. Finally, we demonstrate that there is no silver bullet for TTA: no single adaptation paradigm is universally optimal, and the preferred paradigm depends on the nature of shift. We hope our benchmark and study provide a clearer understanding of the current TTA4CLIP landscape and establish a foundation for further research.

[CV-28] Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision

链接: https://arxiv.org/abs/2606.14297
作者: Amirah F. Alshammari,Bander A. Alzahrani,Nahed A. Alowidi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing accurate crowd-counting models for Hajj pilgrimage scenes remains challenging because domain-specific annotated images are scarce and data collection during large gatherings raises privacy concerns. To address these limitations, this paper proposes Pix2Pix-Hybrid (P2P-H), a hybrid conditional GAN for structure-guided Hajj crowd-image synthesis and data augmentation. P2P-H builds on Pix2Pix and employs a U-Net generator conditioned on eight input channels that jointly encode structural cues (edges and grayscale) and contextual attributes (crowd density and time of day). To capture detailed textures in dense scenes, the framework integrates two multi-scale PatchGAN discriminators operating at different resolutions. The training procedure combines adversarial, perceptual, and feature-matching objectives with adaptive data augmentation and stabilization strategies. The model was trained on 993 real Hajj frames collected from 60 publicly available video sources, with conditioning attributes derived automatically to reduce manual labeling effort. Using this framework, we constructed CrowdH, a synthetic dataset of 10,000 high-resolution Hajj crowd images. Experimental results show that P2P-H improves structure-preserving conditional synthesis quality compared with Pix2Pix and StyleGAN2-ADA baselines and shows favorable transfer to other crowd datasets. To assess downstream utility, we further constructed CrowdH-Mix-469, an annotated mixed real-synthetic dataset comprising 384 real Hajj images and 85 selected synthetic images,and evaluated five crowd-counting models under real-only and real-plus-synthetic training. The selected synthetic data reduced MAE across all five models, with the strongest gain observed for CSRNet.

[CV-29] A Robust Point Cloud Analysis Framework Inspired By Primary Visual Cortex

链接: https://arxiv.org/abs/2606.14292
作者: Jisheng Dang,Dengyue Pan,Delin Deng,Yifan Zhang,Bimei Wang,Hong Peng,Bin Hu,Qi Tian,Tat-Seng Chua
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Despite significant advancements in point cloud analysis, reducing energy consumption and improving robustness remain understudied, largely due to the inherent limitations of Convolutional Neural Networks (CNNs). To address this issue, we draw inspiration from the primary visual cortex and propose a Dendritic-Connected Continuous-Coupled Neural Network (DC-CCNN), a novel Brain-Inspired Neural Network (BINN) architecture for point cloud analysis. By combining discrete and continuous encoding, our design replaces traditional Multilayer Perceptrons (MLPs) with more efficient and robust BINNs. Building upon this framework, we further propose an extended model, DC-CCNN++, to improve robustness under complex corruption conditions. Specifically, we introduce a Neuro-Inspired Robust Modulation-and-Readout Module (NRMR) to enhance feature stability and decision robustness through global-context gain modulation and dual-code evidence integration. We also design a Cortically Inspired Progressive Variability Training (CPVT) strategy, which progressively exposes the model to structured environmental variability while preserving stable clean-sample anchors during training. Experimental results show that DC-CCNN++ improves the performance of brain-inspired networks on point cloud analysis while maintaining performance comparable to state-of-the-art methods. Compared with the original DC-CCNN, it achieves stronger results on both classification and part segmentation, and exhibits enhanced robustness against sparsity, occlusion, Gaussian noise, salt-and-pepper noise, and spatial transformations. With its efficiency, robustness, and biologically grounded design, DC-CCNN++ provides a promising alternative to traditional deep learning methods for point cloud analysis. Code is available at this https URL.

[CV-30] One Layers Trash is Another Layers Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs CVPR2026

链接: https://arxiv.org/abs/2606.14277
作者: Yongru Chen,Kai Zhang,Zeliang Zong,Yuchen Lu,Wenming Tan,Ye Ren,Jilin Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 (highlight)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model’s accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.

[CV-31] HiST: A Hierarchical Sparse Transformer for Cross-Modal Spatial Transcriptomics Modeling

链接: https://arxiv.org/abs/2606.14251
作者: Weiyi Wu,Xinwen Xu,Xingjian Diao,Siting Li,Zhi Wei,Alma Andersson,Jiang Gui
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial transcriptomics (ST) links gene expression with tissue morphology but remains expensive and low-throughput, motivating surrogates that infer expression from routine histology. Whole-slide HE-to-ST inference pairs a gigapixel image with gene measurements at a sparse, irregular set of locations, making multiscale modeling challenging without incurring dense-grid overhead or quadratic token mixing. We propose HiST, a hierarchical sparse transformer that treats measured locations as a lattice-indexed sparse field and builds a dyadic encoder–decoder directly on the active tissue footprint. HiST combines sparse window attention for local geometric correspondence with resolution-changing operators for rapid multiscale context integration. For a fixed window size, the dominant runtime and memory scale with the number of observed locations rather than the dense slide area. To mitigate slide-specific acquisition variation, HiST adds a bottlenecked global conditioning pathway via a \emphslide calibration token that summarizes slide-level context and conditions local representations. On a multi-organ benchmark spanning diverse tissues and acquisition sources, HiST improves predictive performance over recent baselines while reducing runtime and peak memory.

[CV-32] Hybrid Classical-Quantum (HCQ) Alzheimers Classification via Supervised β-VAE and Quantum Kernels

链接: https://arxiv.org/abs/2606.14194
作者: Tia Tiwari,Vamshi Krishna Kancharla,Neelam Sinha
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a two-stage Hybrid Classical-Quantum (HCQ) pipeline for binary Alzheimer’s disease (AD) classification from 3D T1-weighted structural MRI volumes, where the classical and quantum components are designed to complement each other rather than operate independently. A supervised 3D \beta -variational autoencoder (VAE) is trained end-to-end under voxel-wise reconstruction, KL-divergence, and focal classification losses that compress each 3D MRI volume (resized from 152 x 184 x 152 to 96 x 96 x 96) into a 64-dimensional latent code. Partial Least Squares (PLS) regression selects the six components in the latent code that best separate Alzheimer’s Disease (AD) from cognitively normal (CN) subjects and rescales them into rotation angles, which are encoded onto a six-qubit register using the ZZ quantum feature map to give us the respective quantum states. The input to a precomputed-kernel Support Vector Machine (SVM) is an N x N Gram matrix (N = 308), created by calculating the overlap between every pair of quantum states. The novelty of this work lies in the fact that the quantum kernel operates directly on disease-aware features that are learned end-to-end by a supervised autoencoder, rather than on pre-extracted inputs. On 308 ADNI-1 subjects, consisting of 137 AD and 171 CN subjects, the baseline achieved 67.2% accuracy and 0.759 AUC, while the stability-enhanced variant reached 72.1% accuracy and 0.799 AUC with cross-fold variance halved. 3D Grad-CAM further helped validate our model’s focus on brain regions linked to Alzheimer’s. The HCQ pipeline could serve as a general-purpose framework for diagnostic classification across biomedical imaging domains that present similar challenges for classical approaches.

[CV-33] Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

链接: https://arxiv.org/abs/2606.14172
作者: Sirui Zhang,Xu Wang,Zhengyu Wu,Xunkai Li,Hongchao Qin
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative representations, and modality-centric tasks requiring fine-grained cross-modal correspondence. However, existing MAG methods often rely on fixed graph contexts or uniformly fused representations, causing task-agnostic propagation and over-compressed fusion that hinder diverse task requirements and modality-specific evidence preservation. To address this, we propose CoMAG, a unified MAG backbone that learns task-adaptive reliable contexts and modality-preserving alignment within them. CoMAG first conducts Reliable Context Learning by estimating edge reliability from multimodal semantic consistency, complementing raw topology with semantic neighbors, and selecting context components through a task-aware gate. It then performs Modality-preserving Hop-token Alignment by maintaining modality-specific multi-hop trajectories, matching modality-hop tokens across modalities, and decoupling shared and private representations. Thus, CoMAG produces graph and modality representations from one forward pass while retaining modality-specific cues. We further analyze stable propagation, over-smoothing mitigation, and modality-collapse control. Experiments on nine OpenMAG datasets compare CoMAG with feature-only, graph-only, multimodal, and unified MAG baselines across graph-level prediction, modality matching, and graph-conditioned generation. Results show that CoMAG achieves the best reported performance, demonstrating that task-adaptive reliable contexts and modality-preserving alignment improve structural prediction, cross-modal matching, and graph-conditioned generation while retaining sparse edge-linear complexity.

[CV-34] MUSE: Agent ic 3D Scene Authoring via Memory-Grounded Incremental Requirement Satisfaction

链接: https://arxiv.org/abs/2606.14168
作者: Ruijie Xu,Xinnan Zhu,Jiayu Ying,Daoguo Dong,Yuzhou Ji,Xin Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-driven 3D scene generation is a promising technique for digital content creation, embodied AI simulation, and interactive design, yet practical workflows often require refining, extending, or correcting existing scenes while preserving non-target content. Existing methods can produce realistic and structurally plausible scenes, but they generally lack editability with requirement-level state tracking, so part-level failures often lead to full-scene regeneration or manual intervention. To tackle this challenge, we formulate controllable 3D scene authoring as incremental requirement satisfaction, unifying construction and editing. In this paper, we present MUSE, a memory-grounded multi-agent framework in which an Architect compiles instructions into structured requirements, a Sculptor executes local scene operations, and an Inspector verifies each step while updating Working, Scene, and Skill Memory. To evaluate requirement-level controllability and preservation-aware editing, we introduce AuthorBench, offering 145 constrained construction cases and a 1,584-case preservation-aware editing pool paired with external structured checks. On full construction cases, MUSE improves All-Goal success from 37.9 to 80.7 and surface-constraint fulfillment from 35.0 to 92.6 over the strongest baseline. On a stratified 240-case editing test split, MUSE achieves 49.6 All-Goal success, 99.9 preservation rate, and only 0.6 unintended change rate. Beyond automated metrics, human evaluations on compared local-editing baselines support stronger alignment with user intent, and downstream navigation-proxy tests indicate stronger spatial stability. Combined with ablations validating our memory designs, these results establish MUSE as an effective framework for controllable 3D scene authoring.

[CV-35] VideoWeave: Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling

链接: https://arxiv.org/abs/2606.14162
作者: Xunzhi Xiang,Zixuan Duan,Yabo Chen,Zhengxuan Wei,Guiyu Zhang,Zixiao Gu,Zhe Gao,Haibin Huang,Chi Zhang,Qi Fan,Xuelong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale video diffusion models often fail to preserve 3D structure over time, causing geometric drift and implausible motion under viewpoint changes. Existing methods usually enforce geometric consistency by using explicit geometry reconstructions, such as depth maps, point clouds, or reconstructed 3D structures, to define conditions, supervision, or reward signals, making the generator sensitive to errors from upstream geometry pipelines. We propose VideoWeave, a latent-space post-training framework that uses implicit geometry-model features to constrain the generative distribution, providing a more flexible and non-rigid form of guidance that mitigates the impact of reconstruction errors from geometry models. Specifically, VideoWeave adapts these features into geometry latents and jointly models them with video latents in a shared denoising space, allowing geometry to shape the generative distribution during training. To support this process, we build GeoVid-80K, an 80K-video dataset with paired appearance and geometry representations. Experiments on text-to-video and image-to-video generation show that VideoWeave improves geometric coherence while preserving strong visual quality. VideoWeave project page at this https URL

[CV-36] Encoder Winners Do Not Reliably Transfer Across VLA Backbone Scale: A Frozen-Backbone Grafting Diagnostic

链接: https://arxiv.org/abs/2606.14153
作者: Qingping Zeng,Fei She
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 23 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Vision-language-action (VLA) policies typically inherit their vision encoder from upstream VLM releases, but it is unclear whether an encoder choice validated on a small VLA transfers to a larger backbone. We introduce a frozen-backbone grafting diagnostic: the vision tower of a released VLA is replaced by a candidate encoder under a fixed protocol (adaptive average pooling, LayerNorm, and a single trainable linear projector), with the language model and action expert frozen. Across four encoders, two LIBERO suites, two backbones (SmolVLA-450M and \pi_0.5 -3.3B), and two-to-three seeds per cell (40 main grafting runs plus native, LoRA, pooling, and zero-/shuffled-image controls, all scored by offline action MSE), the small-backbone winner does not reliably select the large-backbone top tier: SigLIP is best on SmolVLA across both suites, while on \pi_0.5 DINOv2-small leads the spatial suite and the object suite is a seed-sensitive near-tie band; three of the four backbone-suite comparisons (and 11 of 12 seed-level cells) support backbone-dependent rankings. The grafting wrapper is itself non-neutral with opposite sign across backbones (+45-56% MSE on the SmolVLA native tower, -50-52% on \pi_0.5 ), so all conclusions are conditional on the fixed grafting protocol. We position frozen grafting as a cheap target-backbone diagnostic to run before committing to an encoder at scale, not as a closed-loop deployment claim.

[CV-37] BoRAD: Bootstrap your Own Representations for Multi-class Anomaly Detection

链接: https://arxiv.org/abs/2606.14129
作者: Duy Hoang Khuong,Tri Nguyen Minh,Ngu Huynh Cong Viet
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstruction-based anomaly detection is attractive for industrial inspection, but scaling it from category-specific training to a one-for-all setting is challenging. A single model must reconstruct diverse normal appearances without copying abnormal details, which exposes two coupled failure modes: identical shortcut, where anomalies pass through the reconstruction path, and mis-reconstruction, where normal categories are confused with one another. We propose \textbfBoRAD, a label-free training framework that treats this as a representation-capacity allocation problem. BoRAD uses a shared learnable prototype bank to impose two complementary regularizers: spatial prototype alignment contracts local within-prototype variation to suppress anomaly copying, while prototype-relative global alignment preserves between-prototype structure and improves sensitivity to abnormal angular deviations. The prototype bank and prediction heads are used only during training; inference remains a standard teacher-student feature discrepancy pass, with no class labels, negative pairs, memory retrieval, or prototype lookup. BoRAD achieves competitive one-for-all anomaly detection performance, including 86.2% mAD on MVTec AD, 80.7% mAD on VisA and 73.1% mAD on Real-IAD. Diagnostic analyses further show reduced anomaly leakage, improved normal-category separability, and stronger anomaly-normal score separation.

[CV-38] Conditioning Matters: Stabilizing Inversion and Attention in Diffusion Image Editing KDD2026 ECML

链接: https://arxiv.org/abs/2606.14125
作者: Zheyuan Zhan,Hongchen Li,Can Wang,Yinfei Ma,Mingzhen Huang,Ruoshi Bai,Jiawei Chen,Siwei Lyu,Defang Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ECML PKDD 2026 Research Track

点击查看摘要

Abstract:Inversion-based image editing offers flexible and training-free control but still struggles with inversion accuracy and the trade-off between editing fidelity and background preservation. While recent methods improve inversion formulations or attention interactions, the role of textual conditioning in shaping diffusion dynamics and editing behavior remains underexplored. We show both empirically and theoretically that the precision of textual conditioning influences inversion stability by modulating the geometry of the diffusion velocity field, while also affecting the consistency of cross-branch attention during editing. These effects directly impact background preservation and semantic fidelity. Building on this analysis, we propose SimEdit, a conditioning-aware framework with two complementary components: (a) conditioning refinement, which constructs conditioning signals with improved semantic precision and structural alignment to facilitate stable inversion and consistent attention manipulation, and (b) token-wise cross-branch attention control, which separates edit-relevant and structure-preserving components and modulates them asymmetrically during attention manipulation. Extensive experiments on PIE-Bench demonstrate that SimEdit consistently improves both inversion reconstruction quality and editing performance over previous attention-manipulation approaches. Our code is available at this https URL.

[CV-39] A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

链接: https://arxiv.org/abs/2606.14096
作者: Yanbin Hao,Pengyu Liu,Xing Wei,Xun Yang,Dan Gu,Meng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures

点击查看摘要

Abstract:Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols. To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects. Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization. Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition. These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at this https URL.

[CV-40] FEMOT: Multi-Object Tracking using Frame and Event Cameras

链接: https://arxiv.org/abs/2606.14094
作者: Shiao Wang,Xiao Wang,Chao Wang,Yitao Li,Menghao Liu,Bo Jiang,Yaowei Wang,Yonghong Tian,Jin Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conventional RGB cameras have been widely used in multi-object tracking due to their ability to capture rich appearance and semantic information. However, their performance is often degraded under complex real-world challenges, such as motion blur, low illumination, and overexposure. Bio-inspired event cameras offer high temporal resolution and high dynamic range, providing complementary cues under extreme scenarios. Nevertheless, RGB-event multi-object tracking remains underexplored due to the lack of large-scale and well-annotated datasets. To address this issue, we propose FEMOT, a large-scale RGB-event multi-object tracking dataset that covers diverse real-world scenarios and 14 challenging attributes. With both RGB and event data as well as high-quality annotations, FEMOT provides a reliable platform for systematically evaluating RGB-event multi-object tracking methods. Based on FEMOT, we retrain and evaluate over ten strong trackers, thereby establishing a comprehensive benchmark for future research. Furthermore, we propose FEMOTR, a multimodal tracking framework that decouples RGB and event features and fuses them in the frequency domain, thereby effectively exploiting their complementary characteristics for robust object localization and identity association. Extensive experiments on FEMOT and DSEC-MOT datasets demonstrate the effectiveness of the proposed method. The source code and benchmark dataset have been released on this https URL.

[CV-41] Clay-CNN Hybrids: Leverag ing Geo-Foundational Models as Auxiliary Context for Landslide Detection

链接: https://arxiv.org/abs/2606.14081
作者: Huong Binh Vu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 9 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geo-Foundational Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

[CV-42] ShearFuse-UNet: Hadamard DCT and Shearlet Transform Fusion for Next-Day Wildfire Spread Prediction

链接: https://arxiv.org/abs/2606.14071
作者: Ene Meco,Yingyi Luo,Emadeldeen Hamdan,Adam Watts,Ahmet Enis Cetin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose ShearFuse-UNet, a lightweight and computationally efficient deep learning model for next-day wildfire spread prediction from multi-modal satellite data. The model integrates three complementary transform-domain branches inside each encoder block of a U-Net backbone: a 2D Fast Walsh-Hadamard Transform (WHT) branch, a 2D Discrete Cosine Transform (DCT) branch, and a cone-adapted digital Shearlet residual branch. The WHT and DCT branches establish orthogonal latent spaces with learnable spectral scaling and fixed soft-thresholding, while the Shearlet branch provides anisotropic, multi-directional feature decomposition that explicitly encodes the elongated edge structures characteristic of fire fronts. A learned SpectralFusion gate adaptively combines the WHT and DCT responses, and the Shearlet reconstruction is added as a residual. This three-branch design bears a loose structural analogy to transformer self-attention: the WHT and DCT branches provide complementary spectral representations that are adaptively fused, while the Shearlet branch contributes directional content through a residual pathway. Unlike self-attention, the proposed design relies on fixed mathematical transforms rather than learned projection operators, reducing parameter count and computational cost. Evaluated on the WildfireSpreadTS dataset, ShearFuse-UNet achieves an F1 score of 0.596 with only 267k parameters, outperforming a ResNet18-based U-Net (14M parameters, F1 = 0.589) and demonstrating a highly favorable accuracy-efficiency trade-off. Results on the Google Next-Day Wildfire Spread dataset further validate these findings across a different benchmark.

[CV-43] FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control Temporal Alignment and Semantic Precision INTERSPEECH2026

链接: https://arxiv.org/abs/2606.14049
作者: Shiyao Wang,Xijuan Zeng,Hui Wang,Shiwan Zhao,Feng Deng,Chen Zhang,Yong Qin
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by INTERSPEECH 2026

点击查看摘要

Abstract:We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at this https URL.

[CV-44] WAM4D: Fast 4D World Action Model via Spatial Register Tokens

链接: https://arxiv.org/abs/2606.14048
作者: Ying Li,Xiaobao Wei,Jiajun Cao,Hao Wang,Xiaowei Chi,Chengyu Bai,Qianpu Sun,Jiajun Li,Xiaojie Zhang,Jian Tang,Sirui Han,Shanghang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 15 pages, 7figures, 9tables

点击查看摘要

Abstract:World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.

[CV-45] Rethinking One-Step Image Editing through ChordEdit: Reproduction Simplification and New Insights

链接: https://arxiv.org/abs/2606.14042
作者: Minghan Li,Jeremy Moebel,Mengyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:One-step image editing is important for making text-guided editing fast, practical, and easy to deploy, but its underlying mechanism is still not fully understood. We revisit ChordEdit through reproduction, ablation, and simplification. Our analysis shows that a) the chord window \delta largely acts as an effective timestep shift from t to t - \delta ; b) chord transport acts on high-noise images and mainly performs low-frequency semantic editing; and c) proximal alignment acts on low-noise images and complements it by adding high-frequency target details. In this view, ChordEdit naturally decomposes editing into a coarse low-frequency transport stage and a fine high-frequency alignment stage. These findings suggest a path toward prompt-conditioned dynamic timestep selection for adaptive image editing. All code and results can be found at \hrefthis https URLlink.

[CV-46] oward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

链接: https://arxiv.org/abs/2606.14035
作者: Dinh-Khoi Vo,Nhut-Thanh Le-Hinh,Viet-Tham Huynh,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCCI 2026. Project page: this https URL

点击查看摘要

Abstract:Zero-shot text-guided diffusion has significantly advanced image editing; however, its practical usability remains constrained by three persistent challenges: prompt brittleness that requires meticulous prompt engineering, spillover edits that unintentionally affect non-target regions, and failures on small or cluttered objects caused by limited fine-grained supervision in training data. We propose FocusDiff (Target-Aware Refocusing for Tuning-Free Diffusion Editing), a tuning-free framework for precise and region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while accurately transferring the object’s identity, structure, and appearance to the edited output. Integrated context-preserving modules further ensure background fidelity and global coherence, enabling accurate edits from simple text prompts in a single pass. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments. Extensive experiments on our localized editing benchmark LIMB, comprising 30 multi-object images and 100 annotated examples including challenging small-object cases, show that FocusDiff outperforms existing zero-shot editors in text-image alignment and background preservation, achieving superior precision, photorealism, and usability. The project page is available at this https URL.

[CV-47] GarmentSketch: Large-scale Sketch-to-Fashion Benchmark

链接: https://arxiv.org/abs/2606.14025
作者: Duong-Duy-Khang Bui,Minh-Tan Pham,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCCI 2026. Project page: this https URL

点击查看摘要

Abstract:Fashion sketching is a cornerstone of design workflows, allowing rapid visualization of creative concepts prior to physical prototyping. Yet, progress in sketch-based fashion image synthesis has been hindered by the absence of large-scale, high-quality paired resources. To bridge this gap, we present GarmentSketch, a novel dataset comprising 26,249 fashion sketches across 21 garment categories, each paired with detailed textual descriptions. Captions were produced through a multi-stage pipeline that integrates multiple multimodal large language models (MLLMs) with human-in-the-loop refinement, ensuring both semantic accuracy and descriptive richness. We benchmark GarmentSketch on state-of-the-art generative models, providing baseline performance for sketch-guided text-to-image generation. Our experiments reveal both the promise and the current limitations of existing methods. By offering a comprehensive and richly annotated resource, GarmentSketch establishes a foundation for advancing sketch understanding, fine-grained fashion image generation, and creative human-AI collaboration in design. The dataset will be available at: this https URL.

[CV-48] ViT-Up: Faithful Feature Upsampling for Vision Transformers KR

链接: https://arxiv.org/abs/2606.14024
作者: Krispin Wandel,Jingchuan Wang,Hesheng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.

[CV-49] RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

链接: https://arxiv.org/abs/2606.14010
作者: Xiangyu Huang,Zhenlin Hua,Han Zhou,Shounak Sural,Ragunathan Rajkumar
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.

[CV-50] HARBOR: Heading Analysis and Reconstruction from Behavioral Observation and Radar

链接: https://arxiv.org/abs/2606.14006
作者: Joao P. A. Dantas,Paulo F. Silva Filho,Jelton A. Cunha,Gabriel Dietzsch
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Maritime situational awareness often relies on Automatic Identification System (AIS) transmissions to track vessel movements. However, in operational or conflict scenarios, these data may be unavailable due to signal loss, deliberate deactivation, or intentional spoofing. In such conditions, synthetic aperture radar (SAR) imagery becomes a critical sensing alternative for wide-area maritime monitoring, despite providing only static scene snapshots. This work introduces HARBOR (Heading Analysis and Reconstruction from Behavioral Observation and Radar), a complete pipeline for transforming a single SAR image into predictive motion information without requiring any auxiliary data source at inference time. The method begins with SAR image preprocessing to enhance and segment vessel candidates, followed by automatic detection, size-based classification, and heading estimation using skeleton geometry and local intensity patterns. AIS data are used exclusively during an offline calibration phase to derive vessel-type-dependent motion parameters, which are then applied to generate probabilistic heatmaps of candidate future vessel positions. A case study using real COSMO-SkyMed SAR imagery demonstrates the pipeline on a maritime scene in southern Brazil, showing its ability to extract motion tendencies and generate probabilistic projections of vessel positions in data-denied environments.

[CV-51] Context-Guided Semantic Alignment for Feature Fusion Networks

链接: https://arxiv.org/abs/2606.14005
作者: Hyungseop Lee,Jiho Lee,Woochul Kang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 12 figures, 8 tables

点击查看摘要

Abstract:Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations. In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude. The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.

[CV-52] Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation

链接: https://arxiv.org/abs/2606.13971
作者: Xiaomeng Yang,Yanyu Li,Gordon Guocheng Qian,Ivan Skorokhodov,Viacheslav Ivanov,Avalon Vinella,Xuan Zhang,Yanzhi Wang,Sergey Tulyakov,Anil Kag
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalizing Image-to-Video (I2V) diffusion models with specific visual effects is increasingly demanded for high-end video generation. Current practice requires training a separate Low-Rank Adaptation (LoRA) module for each effect, incurring substantial data curation and iterative optimization costs that hinder interactive control. We present Prompt2Effect, a weight-driven hypernetwork that amortizes per-effect training by directly synthesizing effect-specific LoRA weights in a single forward pass. Unlike prior hypernetworks that regress adapter weights purely from semantics, Prompt2Effect is explicitly conditioned on the frozen base model weights, grounding weight prediction in the structural geometry of each layer. Furthermore, instead of predicting raw LoRA matrices, we introduce an SVD-canonicalized parameterization that resolves factorization ambiguity and stabilizes large-scale weight synthesis. Together, these design principles enable accurate and scalable LoRA prediction for high-dimensional I2V diffusion models. Extensive experiments demonstrate that Prompt2Effect achieves on-par or superior video quality and effect alignment compared to conventional LoRA fine-tuning, while reducing the computational cost from 56 GPU training hours to 3.3 seconds of hypernetwork inference. When used as initialization for subsequent fine-tuning, our predicted weights further improve final performance and accelerate optimization by approximately 10x.

[CV-53] CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

链接: https://arxiv.org/abs/2606.13964
作者: Dongyu Wang,Dar-Yen Chen,Yi-Zhe Song
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognizable distortions. We identify the root cause as \emphcondition signal contamination – competing probability distributions in the denoising trajectory that make balanced generation impossible. We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: \mathcalP^\mathrmi (pure identity), \mathcalP^\mathrms (pure shape), and \mathcalP^\mathrmi+s (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers \mathcalP^\mathrmi+s toward optimal balance: \mathcalE_\mathrmshape ensures sketch fidelity through layout and semantic alignment, while \mathcalE_\mathrmid employs token-level correspondence matching robust to extreme distortions. Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualizes the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.

[CV-54] Self-Evolving Visual Questioner

链接: https://arxiv.org/abs/2606.13929
作者: Yijun Liang,Hengguang Zhou,Ming Li,Lichen Li,Cho-Jui Hsieh,Tianyi Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, including references and appendix. Project Page is available at this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners’ performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.

[CV-55] Overhead Wildlife Locator (OWL): Benchmarking Weakly Supervised Learning for Aerial Wildlife Surveys

链接: https://arxiv.org/abs/2606.13911
作者: Isai Daniel Chacón,Zhongqi Miao,Bruno Demuro,Caleb Robinson,Rahul Dodhia,Lasha Otarashvili,Jason Holmberg,Kirk Larsen,Howard Frederick,Nathan J. Pamperin,Pablo Arbeláez,Juan M. Lavista Ferres
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Automated aerial wildlife surveys increasingly rely on deep learning, yet standard object detectors require bounding-box annotations, reported to be up to seven times slower and three times more expensive to produce than point-level labels. To address this bottleneck, we introduce the Overhead Wildlife Locator (OWL), a weakly supervised density-estimation framework with three variants: OWL-C, a fully convolutional model for high-throughput screening; OWL-T, a Swin-augmented hybrid for heterogeneous, cluttered scenes; and OWL-D, built on a frozen DINOv3 ViT-H+/16 encoder with a DPT-style fusion decoder. We benchmark all three against POLO, YOLOv11n, and YOLOv11l across five public aerial datasets, from sparse fixed-wing savanna surveys to dense UAV paddock imagery, and against the published HerdNet baseline on its native Delplanque split. OWL-D sets a new state of the art on Delplanque (0.934 AP vs. HerdNet’s 0.840) and records the highest AP on four of the five datasets. Performance is regime-dependent: on the extreme-density SheepCounter UAV dataset the hybrid OWL-T leads (0.978 AP) and the convolutional variants attain the lowest counting error, whereas the foundation-based OWL-D degrades, indicating which variant suits which survey type. We further validate operational readiness on the Alaska Department of Fish and Game’s 2022 Central Arctic Caribou census: under cross-herd and cross-temporal transfer, OWL-C fine-tuned on the 2017 Porcupine Caribou Herd split attains F1 = 0.965 on a held-out patch test set, with a signed count error of +3.1% aggregated across the released test patches. We release the OWL code, model weights, and the annotated Porcupine Caribou Herd 2017 (PCH) and Central Arctic Herd 2022 (CAH) patches, the first open patch-level datasets for large-scale caribou aerial surveys, at this https URL.

[CV-56] PMOF: A Dataset and Benchmark for Passenger Monitoring Using Overhead Fisheye Cameras

链接: https://arxiv.org/abs/2606.13910
作者: Stella Katharina Wermuth,Qazi Arbab Ahmed,Klaus Neumann,Thorsten Jungeblut
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 7 figures. Accepted to the 22nd IEEE International Conference on Advanced Visual and Signal-Based Systems (AVSS 2026)

点击查看摘要

Abstract:Autonomous staff-free public transport requires reliable in-vehicle passenger monitoring. However, perception inside moving vehicles is challenged by confined spaces, variable illumination, motion-induced background variation, occlusion, and limited viewpoints. To mitigate these spatial constraints, ceiling-mounted fisheye cameras provide full-scene coverage from a single viewpoint. Yet existing public overhead fisheye datasets are recorded in static environments and do not capture the domain shift introduced by vehicle motion. To fill this gap, we introduce PMOF, Passenger Monitoring using Overhead Fisheye cameras, the first public dataset of top-view fisheye imagery captured inside a moving vehicle, comprising over 19k manually annotated frames. PMOF provides rotated bounding boxes, tracking identifiers, and action labels, supporting object detection, tracking, and action recognition. We benchmark PMOF using YOLO26m-obb models fine-tuned under multiple dataset configurations that combine PMOF with existing overhead fisheye datasets. Cross-domain fine-tuning with custom rotation-aware augmentation achieves 94.8% AP50 on PMOF and 96.5% AP50 on an unseen overhead fisheye dataset from a different domain. Our results highlight the domain gap between static and moving environments and show that incorporating PMOF improves detection performance and advances generalization beyond passenger monitoring to broader fisheye-based person detection tasks. The dataset and code are available at this https URL.

[CV-57] HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

链接: https://arxiv.org/abs/2606.13898
作者: Haoran You,Yotam Nitzan,Lingzhi Zhang,Yifan Gong,Mang-Tik Chiu,Connelly Barnes,Yan Kang,Yuqian Zhou,Eli Shechtman,Sohrab Amirghodsi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures, Patent filled

点击查看摘要

Abstract:Creative image editing tools, such as Photoshop’s Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose \textbfHiLo-Token , an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

[CV-58] How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

链接: https://arxiv.org/abs/2606.13896
作者: Julia Romero,Qin Lv,Morteza Karimzadeh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

[CV-59] PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation

链接: https://arxiv.org/abs/2606.13886
作者: Namai Chandra,Shriram Damodaran,Lin Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 5 figures, supplementary material included

点击查看摘要

Abstract:Vision-Language-Action (VLA) models excel at mapping visual inputs and natural language instructions directly to robotic control policies. However, because they are trained primarily to fit behavioural demonstration data, they do not explicitly enforce fundamental physical principles such as rigid-body dynamics or contact constraints. This exposes a critical physics gap: standard temporal smoothing applied on top of single-step or chunked VLAs trades trajectory quality for added failures that short-term memory cannot resolve. To bridge this gap, we introduce PhysVLA (Physics-VLA), a plug-and-play, inference-time framework designed to wrap any frozen VLA backbone without retraining, fine-tuning, or weight access, with less than 1 ms of overhead per control step. PhysVLA intercepts the predicted control action, captures only the simulator or system state, and applies a dual-layered correction: (i) a phase-aware finite-state machine that structures discrete task segments (approach, grasp, transport, and place), and (ii) a selective Euler-Lagrange gate that activates only when a dynamics oracle detects kinodynamic inconsistency. Evaluated across OpenVLA, OpenVLA-OFT, Force-VLA, and Generalist-VLA on LIBERO-Spatial with a 7-DoF Franka Panda, the framework delivers absolute success rate increases of up to 17% and stability increases of up to 19% with no per-task regressions, improves trajectory efficiency by up to 15% across all four backbones, and shows up to a 10x improvement in trajectory jerk robustness on a Robosuite Lift cross-simulator sweep. We further validate the framework on a real Agilex Piper arm with a pick-and-place task, confirming that PhysVLA transfers to physical hardware without retraining, with success-rate improvements of up to 50%, establishing physical awareness as a composable, backbone-agnostic runtime module.

[CV-60] Avatar V: Scaling Video-Reference Avatar Video Generation

链接: https://arxiv.org/abs/2606.13872
作者: Benjamin Liang,Ce Chen,Desmond Lin,Ivan Somov,Jiajun Zhao,Jiewei Yuan,Jingfeng Zhang,Junhao Huang,Nik Nolte,Pedram Haqiqi,Penghan Wang,Rong Yan,Rui Zhang,Sam Prokopchuk,Sivan Wang,Viktor Goriachko,Yi Ren,Yuanming Li,Yutao Chen,Zhenhui Ye,Zhibin Hong,Zilong Nie,Zujin Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 15 figures. All contributors are listed in alphabetical order by first name

点击查看摘要

Abstract:Generating avatar videos that are not merely visually similar to a target individual but behaviorally recognizable, faithfully reproducing their talking rhythm, gestural tendencies, and expression dynamics, remains an open challenge. Existing methods predominantly condition on single static images, which provide insufficient identity information and cannot capture dynamic motion traits, while standard pixel-level objectives underserve the perceptually critical facial regions that determine avatar fidelity. We present Avatar V, a production-scale framework that addresses these limitations through video-reference-conditioned identity modeling. Rather than compressing identity into fixed-size embeddings, the model conditions directly on the full token sequence of a reference video, learning to reproduce both static identity attributes (facial geometry, skin texture) and dynamic behavioral patterns (talking rhythm, micro-expressions) through attention over the reference context. We introduce Sparse Reference Attention, an asymmetric mechanism achieving linear-complexity conditioning on arbitrarily long references; a motion representation stream enabling closed-loop talking style transfer; and an identity-aware super-resolution refiner inheriting the full reference conditioning. These are supported by a data engine curating 100M+ training clips from 50M raw videos, and a five-stage training pipeline with flow matching pre-training, personality fine-tuning, two-phase distillation (10x acceleration), and RLHF alignment, deployed across thousands of GPUs. Avatar V generates 1080p videos of unlimited duration, achieving state-of-the-art identity preservation, lip synchronization, and generation quality on our cross-scene benchmark, consistently outperforming leading systems including Seedance 2.0, Kling O3 Pro, Veo 3.1, and OmniHuman 1.5 in both automated metrics and human evaluation.

[CV-61] Mirag e Probes: How Vision Models Fake Visual Understanding

链接: https://arxiv.org/abs/2606.13870
作者: Daniel Ben-Levi,Judah Goldfeder,Weiliang Zhao,Raz Lapid,Amit LeVi,Allen G. Roush,Ravid Shwartz-Ziv,Hod Lipson
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model’s visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.

[CV-62] mporal Backtracking Search for Test-time Generative Video Reasoning

链接: https://arxiv.org/abs/2606.13861
作者: Sejoon Jun,Zheng Ding,Huangyuan Su,Weirui Ye,Yilun Du
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process. Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis. TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.

[CV-63] Multi-Agent Embodied Autonomous Driving: From V2X Information Exchange to Shared World Models

链接: https://arxiv.org/abs/2606.13840
作者: Senkang Hu,Zhengru Fang,Yihang Tao,Zihan Fang,Sam Tak Wu Kwong,Yuguang Fang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving is shifting from isolated vehicle intelligence toward multi-agent embodied systems that share perception, infer intent, and coordinate action under uncertainty. This survey examines this transition through the lens of Shared World Models (SWMs): predictive cross-agent representations maintained across vehicles, infrastructure, and other traffic participants. We review more than 380 publications spanning vehicle-to-everything (V2X) communication, collaborative perception, inter-agent cognition, cooperative planning, end-to-end cooperative driving, and simulation and data engines for closed-loop validation. The organizing question is how exchanged observations become aligned state, intent-aware interaction, and coordinated downstream action. Across the surveyed literature, evaluation remains concentrated in simulation, curated benchmarks, and offline protocols. Foundation-model-based coordination also lacks verified real-time safety guarantees in open traffic. These gaps motivate key research priorities for multi-agent embodied autonomous driving (MAEAD): verifiable shared-state maintenance, robust intent and plan alignment, and safe coordinated action under communication, latency, and deployment constraints.

[CV-64] Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

链接: https://arxiv.org/abs/2606.13839
作者: Louis Chen,Torbjörn E. M. Nordling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 26 pages, 8 figures

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) transformers achieve low heart-rate error on benchmarks, yet their decisions remain opaque–a growing concern as rPPG moves toward clinical heart rate estimation. Existing rPPG XAI is dominated by qualitative heatmap inspection without quantitative faithfulness metrics or physiology-grounded validation, leaving a gap between visual plausibility and auditable evidence. We address this gap. First, we adapt four attribution methods (raw attention, rollout, flow, Beyond Intuition) to RhythmFormer’s bi-level routing attention with top- k selection. Second, we introduce a skin coverage metric quantifying how much attribution mass falls on skin regions. Third, we adapt the SaCo faithfulness coefficient from its original classification setting to rPPG regression by using the MAE between original and perturbed predicted rPPG waveforms as the perturbation impact. Applying these tools, we quantify a multi-hop leakage effect under sparse top- k routing: attention rollout and flow almost completely restores the connections that individual refined-attention layers explicitly set to zero. Beyond Intuition mitigates this via its value-projection-weighted rollout and gradient-supported mask, attaining the highest median refined skin coverage ( 0.83 vs. 0.57 for vanilla rollout) and faithfulness ( F=0.92 ) among the evaluated methods on UBFC-rPPG. Validation across diverse datasets and model variants is needed. A case study on a low-SaCo outlier further shows all four methods recovering consistently once an artefactual region is replaced, suggesting consistent SaCo behavior across attribution families in this illustrative case. Together, these metrics move XAI for rPPG toward auditable numerical evidence about spatial alignment and perturbation faithfulness, i.e. trustworthy rPPG XAI.

[CV-65] Compressing Image Style Training into a Single Model Forward

链接: https://arxiv.org/abs/2606.13809
作者: Zhongjie Duan,Yingda Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Diffusion-based style transfer must balance inference efficiency with stylization fidelity. Adapter-based methods are efficient, but they inject style as an external condition and can either weaken reference-specific appearance or copy reference semantics into the generated image. Optimization-based personalization methods such as LoRA internalize style more effectively, but require a separate training process for every new style. We introduce i2L (image-to-LoRA), a framework that amortizes style LoRA training into a single forward pass. Given one or more reference images, i2L predicts LoRA weights for a text-to-image model, enabling immediate style instantiation without per-style optimization. The architecture combines an image encoder, learnable LoRA queries, and compressed decoding heads that generate adapted matrices. Training on semantically diverse style pairs encourages the predictor to preserve appearance cues while suppressing reference-content copying. Experiments on Z-Image, FLUX.2, and Hidream-O1 show that i2L improves style fidelity, prompt alignment, and perceptual quality over existing baselines. Because i2L produces explicit LoRA weights, it also supports asymmetric classifier-free guidance, multi-reference style fusion, and composition with controllable-generation modules.

[CV-66] μ_0: A Scalable 3D Interaction-Trace World Model

链接: https://arxiv.org/abs/2606.13769
作者: Seungjae Lee,Yoonkyo Jung,Jusuk Lee,Jonghun Shin,Amir Hossein Shahidzadeh,Yao-Chih Lee,H. Jin Kim,Jia-Bin Huang,Furong Huang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present \mu_0 , a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, \mu_0 forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains \mu_0 by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that \mu_0 outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because \mu_0 is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as \pi_0 . These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

[CV-67] CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

链接: https://arxiv.org/abs/2606.13768
作者: Sharath Girish,Tsai-Shien Chen,Zhikang Dong,Mukesh Singhal,Hao Chen,Sergey Tulyakov,Aliaksandr Siarohin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.

[CV-68] Connections Between Pairs of Filters Improve the Accuracy of Convolutional Neural Networks IJCNN2023

链接: https://arxiv.org/abs/2606.13736
作者: Kathleen Anderson,Philipp Grüning,Erhardt Barth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCNN 2023

点击查看摘要

Abstract:While researchers continue to find new and improved network structures for CNNs, most of the newly invented architectures still rely on the traditional pattern of stacking convolutional blocks and separating them with pointwise activation functions. However, there are drawbacks to a network purely building on pointwise nonlinearities. One alternative is to introduce a pairwise connection between two filters of a network. Typical connection functions use multiplications or the minimum operation to realize logical AND connections. In this paper, we go one step further by demonstrating that CNNs can benefit from more general connections, which include parameters that are learned. With such parameters, the network is able to implement different connections in different network layers and better adapt the connection function to the task at hand.

[CV-69] Morphology-Aware Sample Assignment: Overcoming IoU Insensitivity for Surface Defect Detection

链接: https://arxiv.org/abs/2606.13723
作者: Pengfei Liu,Yuhan Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intersection-over-Union (IoU), as a pivotal metric for evaluating the spatial alignment between candidate proposals and ground-truth annotations, directly determines the quality of positive sample sets and the training efficacy of visual detection models. Through theoretical modeling and analysis, we uncover a non-sensitive region on the IoU response curve, within which samples yield nearly identical IoU scores despite distinct geometric overlaps. To overcome this limitation, we introduce a set of morphological similarity metrics covering area, shape, and aspect ratio, to refine the positive sample assignment process, thereby ensuring more discriminative and reliable matching. A supplementary matching score is derived via mean-based aggregation of these multidimensional similarities, compensating for the intrinsic limitation of IoU in representing structural correspondence. Theoretically, incorporating morphological similarity reshapes the response distribution of the matching function, yielding both effective directional gradients and polygon-like iso-response contours, which tightly confine high-response regions around each ground-truth instance and substantially enhance the precision of positive sample selection. Experiments based on the YOLOv9 framework demonstrate consistent performance gains on both NEUDET and GC10- DET datasets. Notably, the proposed approach is fully plug-and-play and incurs zero additional inference overhead, thereby ensuring deployment efficiency for industrial visual inspection.

[CV-70] SA: Temporal Slot Activation for Persistent Object-Centric Video Representation

链接: https://arxiv.org/abs/2606.13714
作者: Duc Nguyen,Sieu Tran,Hao Vo,Khoa Vo,Duy Minh Ho Nguyen,Nghi D. Q. Bui,Anh Nguyen,Long Mai,Ngan Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible. We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object’s representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention. We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score \alpha_k,t \in (0, 1) without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous slot via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference. To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.

[CV-71] rimodal Glioma Representation Alignment via Volumetric Contrastive Learning

链接: https://arxiv.org/abs/2606.14568
作者: Denise Marini,Eleonora Grassucci,Danilo Comminiello
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Glioma grading and survival prediction require the integration of heterogeneous information collected at different spatial and biological scales. Histopathology describes tissue morphology, mRNA expression captures molecular activity, and magnetic resonance imaging provides a non-invasive view of tumor extent and radiological heterogeneity. Existing glioma prognosis models often combine only two of these sources, while their alignment objectives remain mostly pairwise. This paper introduces GLORIA, a novel trimodal framework for GLioma Omics - Radiology - hIstopathology Alignment. GLORIA processes whole-slide image regions, gene-expression profiles, and 3D MRI volumes through modality-specific encoders, projects them into a shared latent space, and aligns them with a Gramian contrastive loss that measures the volume spanned by the three modality embeddings. The aligned representations are fused through a cross-modal gating module and optimized jointly for three-class glioma grading and overall survival prediction. We evaluate GLORIA on a matched TCGA-GBM/LGG and BraTS21 cohort, comprising 132 patients with all three modalities. On the shared trimodal test set, GLORIA improves over the bimodal WSI-mRNA baseline in all the metrics considered.

[CV-72] Spectrum Aware Illumination Estimation Using Multispectral Image

链接: https://arxiv.org/abs/2606.14248
作者: Hyejin Oh,Woo-Shik Kim,Sangyoon Lee,YungKyung Park,Je-Won Kang
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). DOI: https://doi.org/10.1109/TCSVT.2026.3701975

点击查看摘要

Abstract:Multispectral (MS) imaging extends beyond conventional RGB imaging by capturing more spectral bands, thereby improving illuminant spectrum estimation (ISE). However, existing methods often fail to fully exploit spectral information, resulting in suboptimal performance under diverse lighting conditions and across different sensor domains. Hence, we propose a deep learning framework with a spatio-spectral feature extraction block, which incorporates spectral attention mechanisms to enhance spectral correlation and preserve illuminant-relevant spatial features. Through the inclusion of an illuminant prior (IP), our approach prioritizes specific channels that provide more meaningful information in an MS image. We also propose a spectral-domain transform across different MS sensor spaces. The results demonstrate that illuminant spectra learned in high-dimensional sensor spaces can be effectively transformed to various lower-dimensional camera sensor spaces without any additional training. To facilitate evaluation, we introduce a real-world MS dataset containing high-dimensional ground-truth illumination spectra captured under diverse lighting conditions. Through extensive experiments, we demonstrate that our method achieves superior accuracy compared to existing models, thus providing a practical solution for real-world ISE. The code and dataset are available at this https URL.

[CV-73] High-Fidelity Video Compression based on Invertible Neural Transform and Implicit Conditioning

链接: https://arxiv.org/abs/2606.13957
作者: Siyue Teng,Ho Man Kwan,Yuxuan Jiang,Fan Zhang,David Bull
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Learning-based video compression has recently achieved competitive rate-distortion performance compared to conventional video codecs. However, most existing methods rely on non-invertible analysis-synthesis transforms, with reconstruction quality subject to both quantization and transform approximation errors. This limitation becomes particularly restrictive at higher quality points, where quantization errors are small and transform-induced distortion dominates. To address this, we propose InnVC, an Invertible neural network based Video Codec for wide-range and high-fidelity compression. The core idea is to preserve an invertible main transform path prior to quantization, while injecting content-adaptive context through a compact implicit conditioning field. This decouples strongly correlated video content from harder-to-model fine details, allowing different components to specialize in complementary reconstruction tasks for more efficient compression. To further improve compressibility, we introduce a scheduled masking strategy that progressively concentrates informative content into fewer latent channels for more effective entropy coding. Experiments on the UVG and MCL-JCV benchmarks show that InnVC achieves strong compression performance over a broad quality range, being particularly effective in the high-quality regime, yielding BD-rate reductions of 21.66% in PSNR and 46.06% in MS-SSIM relative to x265 on UVG. To the best of our knowledge, InnVC is the first neural video codec covers operating poins from low bitrate to high fidelity within a single architecture scale, spanning more than 20 dB in PSNR.

[CV-74] GMN4AD: Graph Matching Network for Alzheimers Disease Diagnosis with Test-Time Domain Adaptation using Multi-centered Structure Magnetic Resonance Imaging

链接: https://arxiv.org/abs/2606.13919
作者: Chen Zhao,Huan Huang,Yixin Xie,Jiajing Huang,Weihua Zhou,Nandakumar Narayanan
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder that affects millions of older adults, with prevalence expected to rise significantly in the coming years. Early diagnosis, particularly during the mild cognitive impairment (MCI) stage, is critical for timely intervention. Structural Magnetic Resonance Imaging (sMRI) has emerged as a key modality for detecting AD-related brain changes, but traditional graph-based approaches often struggle with modality and inter-site heterogeneity, limiting diagnostic performance. In this paper, we propose Graph Matching Network for Alzheimer’s Disease Diagnosis (GMN4AD), designed to model interactions between heterogeneous brain graphs derived from neuroimaging data. Unlike conventional methods that treat each brain graph independently, GMN4AD leverages graph matching to capture cross-graph relationships, enhancing diagnostic precision. Furthermore, we introduce a test-time domain adaptation strategy that combines contrastive learning to mitigate domain shifts during inference. Extensive experiments on three public AD datasets demonstrate that GMN4AD achieves superior performance compared to state-of-the-art methods, offering a robust and generalizable solution for AD diagnosis.

[CV-75] C-MambaPose: A Physics-Informed Complex Mamba Framework for Cross-Environment WiFi Human Pose Estimation

链接: https://arxiv.org/abs/2606.13700
作者: Phuc Nguyen H
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human pose estimation (HPE) utilizing wireless WiFi signals has emerged as a promising technology owing to its device-free nature, privacy preservation, and robustness against occlusion and poor lighting. However, existing methods often overlook the physical complex phase information of WiFi signals and fail to generalize across diverse environments due to severe domain shifts. In this paper, we present C-MambaPose, a physics-informed complex-valued Mamba-GraFormer hybrid framework for robust cross-environment WiFi-based 3D HPE. Our framework first sanitizes raw WiFi Channel State Information (CSI) phase errors and constructs a phase-preserving complex-valued representation. We then employ a Spatiotemporal Complex Mamba encoder with a dynamic selective receptive field to capture fine-grained phase dynamics. A cross-attention joint-query mapper maps the unstructured sequence tokens to human joints, which are decoded by a Graph Convolutional Network (GCN) to predict anatomically coherent 3D coordinates. Extensive evaluations on the MM-Fi dataset show that C-MambaPose achieves competitive or superior performance to state-of-the-art baselines across all settings, setting a new state-of-the-art specifically on the challenging cross-environment split, requiring only 3.78 M parameters-an 83.1% reduction compared to GraphPose-Fi~\citechen2026graph and an 85.7% reduction compared to MetaFi++~\citezhou2023metafi++, while maintaining a comparable size to DT-Pose~\citechen2025towards (which is only 18% smaller) but achieving significantly superior performance without requiring any pretraining. Our code is publicly available at this https URL.

人工智能

[AI-0] Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models INTERSPEECH2026

链接: https://arxiv.org/abs/2606.14647
作者: Ravi Ranjan,Utkarsh Grover,Xiaomin Lin,Agoritsa Polyzou
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures, and 9 tables. Accepted in Interspeech 2026 conference

点击查看摘要

Abstract:Transformer-based automatic speech recognition (ASR) models such as Whisper are highly accurate, but their predictions remain difficult to interpret. Existing explainable AI (XAI) methods often lack faithfulness and precise temporal grounding. We propose Listening with Entropy-guided Attention for Faithful explainability (LEAF-X), a model-intrinsic XAI framework for transformer-based ASR. LEAF-X combines entropy-guided attention weighting, multi-layer attention rollout, and optional causal ablations to identify low-entropy, high-impact heads and layers, producing sparse token-to-frame attributions. Unlike perturbation-based explainers or raw attention maps, LEAF-X exploits the internal structure of encoder-decoder and speech-augmented decoder-only models to generate explanations that better reflect model computation. Results show 32% improved faithfulness, 35-39% stronger locality/sparsity, and the most stable attributions, supporting more transparent and auditable ASR.

[AI-1] From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

链接: https://arxiv.org/abs/2606.14639
作者: Hugo Daumain,Driss Matrouf,Khaled Khelif,Mickael Rouvier
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, accepted at Odyssey 2026 (The Speaker and Language Recognition Workshop)

点击查看摘要

Abstract:Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we transform a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization. Feed-forward blocks in selected encoder layers are replaced by multiple expert networks controlled by a layer-wise gating mechanism, allowing experts to capture complementary acoustic patterns while preserving the representations learned during self-supervised pretraining. We further analyze the architectural choices affecting the performance of this MoE conversion and investigate the activation behavior of the experts. The proposed approach is evaluated on 14 spoofing datasets and reduces the macro EER from 5.46% to 4.81%, corresponding to 11.9% relative improvement over the baseline.

[AI-2] When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks

链接: https://arxiv.org/abs/2606.14629
作者: Jianzhe Lin
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figure

点击查看摘要

Abstract:Verifier-driven self-DPO is a common recipe for self-improving production visual-language models. In this setup, a frozen verifier scores candidate generations, the top- and bottom-scoring candidates form a preference example, and DPO updates the learner. The deployment-time assumption is monotone: a stronger verifier should yield a stronger student. We show that this assumption can fail because verifier quality is highly task-specific. On a four-rung open-source verifier ladder across MathVista, MMMU, and BLINK, the same verifiers that are above-threshold and improve a Qwen-3-VL-2B student on MathVista become sub-threshold on MMMU, where their task-rubric accuracy drops to 8% to 23%. In this regime, every verifier we tested silently regresses the student, producing drops of 3.4 to 10.9 percentage points below the frozen baseline while the DPO training loss continues to decrease. The regression replicates on a second student, Qwen-2.5-VL-3B. Moreover, within the failure regime, damage is confidence-inverted: the more accurate-but-still-wrong verifier causes larger regression than a near-random verifier, suggesting that progress-gated replay amplifies confidently wrong preference pairs. We give a compact mechanistic explanation via a variance theorem for progress-gated replay and its direction-mismatch failure mode. The deployment message is operational rather than purely diagnostic: before running any verifier-driven loop, teams should measure target-task rubric accuracy, rank verifiers by target-task rubric quality rather than parameter count, and treat diminishing returns in above-threshold regimes as a verifier-side compute budget cap. Comments: 12 pages, 2 figure Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.14629 [cs.CR] (or arXiv:2606.14629v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.14629 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethovens Op. 27 No. 2 and Machine Learning Mechanisms

链接: https://arxiv.org/abs/2606.14612
作者: Chen Ying Claude,Zhihan Luo
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We show that the three movements of Beethoven’s “Moonlight Sonata” (Op. 27 No. 2) instantiate three distinct machine learning architectures – not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, dissonance, hand distributional overlap, self-similarity matrices, temporal memory decay, and contextual pitch embeddings), we establish four counterintuitive findings: (1) perceived musical “temperature” is governed by throughput, not distributional width; (2) the lightest movement carries the highest dissonance; (3) the movements implement streaming, recurrent, and periodic positional encoding memory architectures; and (4) the same pitch class acquires different contextual identities across movements, analogous to contextual this http URL embeddings in NLP – and unsupervised clustering recovers the tonal structure without music-theoretic input. We construct a reverse sonification (decoding analytical features back into MIDI) and quantify the chirality of the encode-decode cycle: what distributions preserve and sequential ordering destroys. Prompted by a listener’s observation that the decoded piece sounds like “mirror isomers that can’t be superimposed,” the chirality measurement reveals reconstruction loss increasing monotonically with n-gram order. Bootstrap baselines and subsample checks confirm all movements carry sequential information above noise, though raw values are confounded by sample size. Cross-domain comparison shows natural language has higher chirality than music, reflecting stronger sequential constraints.

[AI-4] Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

链接: https://arxiv.org/abs/2606.14608
作者: Farica Zhuang,Zixuan Wen,Christos Davatzikos,Li Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Survival prediction plays a central role for healthcare providers and clinical researchers. Accurate risk stratification enables early intervention and improved patient management. Most existing deep survival models learn one common feature representation for all patients, which may hide important differences between patient subgroups. In contrast, a Mixture-of-Experts (MoE) framework allows different parts of the model to focus on different patient patterns, leading to more individualized representations. Therefore, in this work, we propose a mixture-of-experts enhanced adaptive deep clustering survival framework (AdaCSM) for modeling such heterogeneous survival patterns. We introduce a routing-based expert mechanism that enables conditional specialization within a parametric survival modeling framework. The proposed architecture allocates patients to specialized risk predictors dynamically while preserving the patient survival and subtype clustering objectives. We compare our method with state-of-the-art survival and deep clustering models on multiple real-world longitudinal clinical cohorts spanning diverse disease domains. The proposed method demonstrates improved predictive performance and leads to interpretable results in survival analysis.

[AI-5] A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

链接: https://arxiv.org/abs/2606.14604
作者: Pavlos Nicolaou,Kleanthis Malialis,Artemis Kontou,Panayiotis Kolios
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wearable devices and smartphones generate rich behavioural time series that can support proactive health interventions, yet systematic comparisons of modern forecasting architectures for these data are lacking. In particular, it remains unclear how models generalise across populations, how different architectures respond to participant-level fine-tuning and how forecasting accuracy degrades across multi-day horizons. We benchmark six deep learning architectures, two zero-shot Foundation Models (FM) and statistical baselines on three public datasets encompassing over 800 participants, reporting per-feature metrics for step counts, screen time and sleep duration across 1-8 day horizons. We further conduct a per-feature personalisation study across all six architectures and assess FM transferability across dataset sizes and temporal granularities. Our key findings are: (i) no single architecture dominates, PatchTST leads among trained models while the three runners-up (TCN, MLP, Transformer) show no meaningful performance difference; (ii) the FM TimesFM matches or exceeds trained models zero-shot, especially in low-data regimes and (iii) participant-level fine-tuning reduces per-feature RMSE by 16-60%, with sleep benefiting most and step counts least. These results provide practical guidance on architecture selection, FM applicability and personalisation strategies for mobile health forecasting. To the best of our knowledge, this is the first study to jointly evaluate modern deep learning, FMs and personalisation for multi-horizon behavioural forecasting from wearables.

[AI-6] Regulating the Machine Contributor: Governance and Policy Alignment in Open Source

链接: https://arxiv.org/abs/2606.14594
作者: Jassem Manita,Aziz Amari
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-assisted software development has moved from line-level autocomplete to agents that can plan changes, edit files, and submit pull requests with limited human supervision. Open-source software, however, evolves through a process designed for humans: contributor agreements, codes of conduct, and review norms all assume a legally accountable person who can attest to provenance and answer reviewer questions. Autonomous and semi-autonomous AI contributors strain those assumptions, and the 2025-2026 record of agent-driven incidents, AI-generated nuisance volume, and platform-level shutdowns shows that the gap is operationally consequential. Several open-source organisations have responded with contribution policies, but the result is fragmented, and its alignment with emerging AI governance frameworks (EU AI Act, NIST AI RMF with the UC Berkeley Agentic AI Profile, ISO/IEC 42001 and 23894) is unmapped at the contribution level. We compare policies across six organisations (SymPy, LLVM, matplotlib, OpenInfra, the Apache Software Foundation, and the Linux Foundation) using Most-Similar Systems Design with indicator-based coding and process tracing for SymPy and LLVM. From this we derive a six-dimensional taxonomy (disclosure, responsibility, human oversight, licensing, enforcement, maintainer workload), an ordinal Policy Maturity Score, and a mapping of documented agent incidents onto the dimensions each policy fails to govern. Aligning the dimensions with the regulatory frameworks above identifies overlapping gaps neither side currently closes, and we close by sketching the shape of a harmonised tiered framework and the empirical evaluation needed to calibrate it.

[AI-7] AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

链接: https://arxiv.org/abs/2606.14591
作者: Hui Geng,Yi Su,Han Yin,Tianjiao Wan,Qisheng Xu,Jiaxin Chen,Zijian Gao,Hengzhu Liu,Xie Chen,Kele Xu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.

[AI-8] When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

链接: https://arxiv.org/abs/2606.14589
作者: Wei Wu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 18 pages, 5 figures, 2 tables. 22 incident postmortems and all defense-framework artifacts publicly available at this https URL governance engine on PyPI (openclaw-ontology-engine)

点击查看摘要

Abstract:LLM agent systems increasingly run as long-lived autonomous runtimes: scheduling jobs, calling tools, maintaining memory, and pushing results to humans. We present a longitudinal study of silent failures in one such system: a personal-assistant agent runtime in continuous production since March 2026, with roughly 40 scheduled jobs, 8 LLM providers, a tool-governance proxy, and a knowledge-base memory plane, defended by 4,286 unit tests and 827 governance checks. Over eight weeks we documented 22 incidents with full root-cause postmortems, in which one meta-pattern – a failure whose error signal never reaches a human in actionable form – manifested at least 28 times. We derive a five-class, mechanism-oriented taxonomy: (A) environment and platform quirks, (B) design-assumption mismatches, © error swallowing and dilution, (D) chained hallucination and fabrication, (E) operational omission and forensic blind spots. Class D is unique to LLM systems and the most dangerous: the system does not merely fail to report an error – the LLM transforms it into fluent, plausible narrative delivered to the user. We term this fail-plausible: gray failure’s differential observability escalated – the observer is not just blind, it is convincingly lied to by the failure itself. Three findings: about 70% of silent failures were caught by human user-view observation, not tests or audits; a retrospective audit of 15 incidents found 0% ex-ante prevention but 87% regression blocking – audits are regression engines, not prediction engines; incident latency (13 hours to 60 days) tracks failure mechanism, not code complexity – the longest-lived failures lived in the seams between components, where no test runs. We describe the resulting defense framework and distill design principles for agent systems whose failures are loud, attributable, and boring. All postmortems and artifacts are public.

[AI-9] Sensitivity Shaping for Latent Modeling

链接: https://arxiv.org/abs/2606.14585
作者: Hongzhan Yu,Chenghao Li,Ruipeng Zhang,Henrik Christensen,Sicun Gao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative dynamics models enable planning in challenging robotic systems, but safe deployment requires reliably detecting policy-induced out-of-distribution (OOD) transitions. Existing methods typically treat the learned dynamics as fixed and attach post hoc support surrogates. We show that these surrogates can fail when the dynamics are locally insensitive to critical action choices: unsupported control actions may produce latent predictions that resemble demonstrated transitions, suppressing OOD signals despite large true predictive errors. To address this, we introduce support-conditioned control-sensitivity regularization, which promotes sensitive local response to control input changes in learned dynamics in high-support training regions. This preserves control-induced variation while limiting unstable extrapolation due to weak empirical support. Experiments in vision-based obstacle avoidance, manipulation, and real-robot navigation show improved OOD detection and safer closed-loop planning.

[AI-10] A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems

链接: https://arxiv.org/abs/2606.14582
作者: Pollob Chandra Ray,Sabah Binte Noor,Fazlul Hasan Siddiqui
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient route optimization play a vital role in ensuring both safety and punctuality in railway operations. It is very crucial particularly in heterogeneous multi-gauge railway networks with varying train speed, stopping pattern, infrastructure compatibility constraints increase coordination complexity. In single-track systems these challenges are further intensify due to all trains to share the same track and requires frequent track this http URL disruptions events including blocked tracks, blocked trains, engine failure and speed slowdowns introduces additional unpredictability in operations and deviate the timetable. However, existing studies predominantly focuses on high-level timetabling, omitting operational details such as track switching coordination. As a result leaving decision to human operators, increasing safety risks into railway operations. This study proposes a framework based on temporal planning for dynamic route optimization and disruption management in heterogeneous railway systems. The framework formulates railway operations as a temporal planning problem using PDDL 2.1 with explicitly modeling gauge compatibility constraints and diverse disruption scenarios. It generates conflict-free timestamped operational plans specifying both optimized schedules and executable action sequences. To evaluate the proposed framework, we developed a benchmark problem set with 200 instances using up to 1,000 track points and 120 trains. Two state-of-the-art temporal planners and a plan validator were employed to assessed the framework. The experimental results demonstrate that the framework effectively generates temporal operational plans for heterogeneous railway systems and handles multi-gauge constraints, disruptions, and reduces dependence on manual decision making.

[AI-11] CARE: Controlling LLM -Generated Policies through Auditable Review of Evidence in Scientific Experimentation

链接: https://arxiv.org/abs/2606.14581
作者: Guanyu Liu,Weiyi Kong,Zeyu Wang,Boer Zhang,Baiqing Li,Peiyu Zhang,Tianyu Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures

点击查看摘要

Abstract:Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger’s selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.

[AI-12] VISTA: View-Consistent Self-Verified Training for GUI Grounding

链接: https://arxiv.org/abs/2606.14579
作者: Xinyu Qiu,Yunzhu Zhang,Heng Jia,Shuheng Shen,Changhua Meng,Linchao Zhu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI this http URL view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding this http URL ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.

[AI-13] StreamMemBench: Streaming Evaluation of Agent Memory for Future-Oriented Assistance

链接: https://arxiv.org/abs/2606.14571
作者: Guanming Liu,Yuqi Ren,Hansu Gu,Peng Zhang,Weihang Wang,Jiahao Liu,Ning Gu,Tun Lu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A central role of personal-agent memory is to turn stored information and prior interactions into future-oriented assistance. In daily use, useful cues come from what the agent observes and how the user interacts with the agent, and the agent must carry them forward from the current request to similar future tasks. Existing memory benchmarks usually test dialogue recall or task improvement in isolation, leaving the trajectory from streaming observations to later assistance largely untested. We introduce StreamMemBench, a streaming benchmark that constructs a two-step task sequence around each evidence anchor from EgoLife egocentric streams. The initial task tests evidence use, while the follow-up task tests whether feedback and interaction experience are reused. Four metrics diagnose evidence recall, initial evidence use, feedback incorporation, and follow-up reuse. Experiments with eight memory systems across two backbones show that current systems often fail to use observed evidence or turn feedback into reliable follow-up behavior, even when evidence is stored or feedback is incorporated locally. StreamMemBench is publicly available at this https URL.

[AI-14] RACE: Trajectory-Routed Causal Memory for Delayed-Evidence Visuomotor Imitation

链接: https://arxiv.org/abs/2606.14551
作者: Zihao Li,Ranpeng Qiu,Yincong Chen,Guoqiang Ren,Weiming Zhi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robots under autonomous operation may require decisions based on evidence that is no longer visible. We study \emphdelayed-evidence tasks, where an early cue disappears before a later decision point, so visually similar observations can require different actions. In these settings, the current observation is not a sufficient state for control. We introduce TRAjectory-routed Causal Evidence (TRACE), a memory framework for visuomotor imitation policies. TRACE stores task-relevant visual and robot-state evidence, such as object identity, target choice, or route-dependent state, in a fixed-size latent memory that remains bounded over long episodes. Instead of indexing memory by raw time or manually provided task labels, TRACE uses \emphpath signatures: compact, order-sensitive features of the executed robot-state trajectory. These signatures do not store the visual cue itself; rather, they provide trajectory-conditioned keys for writing and retrieving the evidence stored when the cue was visible. When the robot later reaches an ambiguous observation, the policy conditions on TRACE memory to recover the missing context and choose the correct branch. TRACE attaches through lightweight adapters to policies, without changing the policy backbone, action head, or imitation objective. Across real-world long-horizon manipulation tasks with visually ambiguous branch points, TRACE improves branch selection and task success over alternative baselines, including short-history and recurrent memory. Project page: this https URL

[AI-15] From Shield to Target: Denial-of-Service Attacks on LLM -Based Agent Guardrails

链接: https://arxiv.org/abs/2606.14517
作者: Yuguang Zhou,Xunguang Wang,Pingchuan Ma,Zhantong Xue,Zhaoyu Wang,Shuai Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based guardrails have emerged as a highly effective defense against prompt injection and jailbreak attacks in autonomous agents. However, we reveal that the very reasoning and task-following capabilities enabling this protection introduce a novel vulnerability: attackers can inject crafted data to trap the guardrail in extended reasoning loops, effectuating a systematic denial-of-service (DoS) attack. To systematically expose this threat, we design a beam-search optimization framework that crafts natural-language payloads to maximize guardrail reasoning length, utilizing an LLM proposer guided by a strategy bank. Based on the observation of guardrail’s schema-following nature, we also provide another attack framework driven by mechanism-aware structural mutations with less computational load. The attack efficacy is systematically evaluated in two parts. First, in standalone evaluations, the attack generalizes across diverse guardrail architectures, safety templates, and agent benchmarks. Payloads optimized on a single open-source surrogate successfully transfer to eight leading model backbones (e.g., Claude, GPT, Gemini, DeepSeek, and Qwen), achieving a 13–63 \times token amplification. Second, in end-to-end real-world agent deployments (web, desktop, code, and multi-agent systems), the attack reveals up to a 148 \times latency amplification. We show that a single poisoned document can saturate shared guardrail infrastructures, effectively starving co-located agents and paralyzing the entire system. By uncovering this availability flaw, our work underscores the urgent need to develop cost-bounded, reasoning-robust guardrails.

[AI-16] Securing the Future of IoMT in the Post-Quantum Era: An Edge-Native Federated Learning Approach

链接: https://arxiv.org/abs/2606.14515
作者: Taym Alshoghri,Deemah H. Tashman,Mohammad Reza Gerami,Soumaya Cherkaoui
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Internet of Medical Things (IoMT) devices operate under strict resource constraints while handling highly sensitive health data, making security and privacy critical concerns. Federated learning (FL) further complicates this landscape, as model updates exchanged during training may unintentionally expose private medical information. Emerging quantum computing capabilities threaten the long-term viability of conventional lightweight cryptographic mechanisms, motivating the integration of Post-Quantum Cryptography (PQC) into IoMT systems. This article discusses key enabling technologies for quantum-resilient IoMT, including post-quantum key establishment, lightweight encryption, and edge-native orchestration. We propose a scalable Kubernetes-based framework that integrates PQC into FL-enabled IoMT environments and validate it on a Raspberry Pi testbed. Results demonstrate that distributed cryptographic processing significantly reduces latency compared to sequential designs while maintaining feasible resource overhead. The primary contribution of this work lies in the design and validation of a secure orchestration and communication framework for FL-enabled IoMT systems. We conclude by outlining future directions toward energy-aware architectures, intelligent security optimization, and resilient next-generation Intelligent Internet of Medical Things (IIoMT) ecosystems.

[AI-17] Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models

链接: https://arxiv.org/abs/2606.14507
作者: Chenyu Zhou,Qiliang Jiang,Boguang Pan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface. In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware F1@0.3 from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23). A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence. The target signal is separable: object-level repeat-stop removes exact repeated records (duplicate rate 0.000, max repeat 1) while preserving F1 (0.494 to 0.490) and stricter F1@0.5 (0.381 to 0.385). Structure-axis probes localize the effect to bbox-coordinate object lists; dense non-bbox and spatial/count JSON remain repeat-clean, including under high-capacity adapters. Qwen3-VL-8B reproduces a clean controlled endpoint (F1@0.3 0.318, duplicate rate 0.000), and COCO 2017 reproduces acquisition plus duplicate pressure. Dense coordinate-list adaptation therefore creates a structure-bound, cross-family interference surface that can be measured and controlled.

[AI-18] From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

链接: https://arxiv.org/abs/2606.14502
作者: Yongheng Zhang,Ziang Liu,Jiaxuan Zhu,Shuai Wang,Xiangqi Chen,Haojing Huang,Jiayi Kuang,Siyu Chen,Ao Shen,Hao Wu,Qiufeng Wang,Qian-Wen Zhang,Junnan Dong,Wenhao Jiang,Ying Shen,Hai-Tao Zheng,Yinghui Li,Di Yin,Xing Sun,Philip S. Yu
类目: Artificial Intelligence (cs.AI)
备注: The paper is available on the project website: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs are advancing from Chatbot-era “fast thinking” systems driven by next-token prediction toward Thinking LLMs that leverage inference-time computation, Chain-of-Thought reasoning, reflection, process supervision, and reinforcement learning to support more deliberate and reliable cognition. Second, at the tool-augmented task execution level, LLMs are progressing from tool-calling Agents that invoke external resources in an ad hoc manner toward OpenClaw-style workstation systems (OpenClaw) equipped with persistent Workspaces, skills, verification loops, and governance. The “Workspace + Skill” paradigm makes episodic tool use colleague-like via state persistence, reusable procedures, task closure, and experience reuse. We examine data construction shifts from instruction-response pairs to State-Action-Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self-evolving AI ecosystems.

[AI-19] When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools and Stronger Backbones Defer More

链接: https://arxiv.org/abs/2606.14476
作者: Zhongyuan Wang,Pratyusha Vemuri
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures. Under review at TMLR

点击查看摘要

Abstract:A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expose a frozen GNN to a ReAct-style LLM agent as an explicit tool and measure, on node classification over a text-attributed graph (ogbn-arxiv, replicated on WikiCS), whether the agent uses the tool or merely obeys it. We find the agent does not exercise judgment: its predictions agree with the raw GNN’s 97.6-99.2% of the time (5 seeds), collapsing into a GNN parrot that adopts the tool’s output wholesale and bypasses its own reasoning. Sweeping backbone capability (Qwen2.5 0.5B-7B), the deference is not a weak-model artifact: among models able to invoke the tool, agreement rises with capability (0.60 to 0.98 from 1.5B to 7B). Crucially, the cost of deference does not shrink as capability grows and grows where alternatives emerge: a per-node oracle over the available actions beats the parrot by 0.09-0.18 at 3B and 0.12-0.22 at 7B, roughly doubling at high homophily, because the parrot is pinned to the frozen GNN while the agent’s alternatives improve; at 7B a simple neighbour-label tool overtakes the GNN at high homophily (0.81 vs 0.71) yet the agent still defers. A simple selective-invocation gate recovers about half of that high-homophily gap (0.71 to 0.83) but yields no net global gain, and held-out estimates bound the best achievable gate over standard test-time features to at most a third of the oracle headroom: reliable selective invocation looks limited by available information, not merely router design. Our results are a cautionary measurement: evaluations of agent+tool systems cannot assume the agent adds judgment on top of the tool, and selective invocation must be designed in rather than expected to emerge from scale.

[AI-20] he Perceived Frag ility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions ICML2026

链接: https://arxiv.org/abs/2606.14466
作者: Piotr Kitłowski,Dominik Wiącek,Mateusz Modrzejewski
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the ICML 2026 Workshop on Machine Learning for Audio: 5 pages, 4 figures

点击查看摘要

Abstract:This paper investigates the fragility of post-hoc explanation methods in audio deepfake detection. While previous work on explanation manipulation focused on images using standard L_p metrics, we introduce a psychoacoustic framework that optimizes inaudible perturbations to decouple model attributions from final classifications. We evaluate this vulnerability across state-of-the-art architectures under strict prediction-preserving constraints. By evaluating the manipulation cost through domain-specific perceptual audio quality metrics alongside explanation alignment criteria, our framework demonstrates that an adversary can systematically distort automated explanation heatmaps while preserving the predicted deepfake label. Full code available at: this https URL

[AI-21] CADET: Physics-Grounded Causal Auditing and Training-Free Deconfounding of End-to-End Driving Planners

链接: https://arxiv.org/abs/2606.14438
作者: Zikun Guo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8pages 4figures

点击查看摘要

Abstract:End-to-end (E2E) autonomous-driving planners trained by imitation are prone to statistical shortcuts: they associate scene elements that merely co-occur with expert actions (a roadside object, a building facade) with driving decisions, rather than the variables that causally determine them. Such causal confusion silently compromises reliability in long-tail scenarios, and it is difficult to detect, because prevailing open-loop metrics (L2 displacement and collision rate) are dominated by ego status and do not indicate whether a planner depends on spurious cues. Existing remedies based on causal-intervention training require retraining large models and cannot audit a planner that is already deployed. We present CADET, a training-free framework that audits, benchmarks, and repairs spurious reliance in pretrained E2E planners without any parameter update.

[AI-22] Causal Object-Centric Models for Planning with Monte Carlo Tree Search

链接: https://arxiv.org/abs/2606.14418
作者: Rodion Vakhitov,Leonid Ugadiarov,Alexey Skrynnik,Aleksandr Panov
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.

[AI-23] CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2606.14415
作者: Ayoub Belouadah,Sylvain Kubler,Yves Le Traon
类目: Artificial Intelligence (cs.AI)
备注: Accepted as a Spotlight paper at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as Constrained Markov Decision Processes (CMDPs). While primal-dual methods scale well to deep RL, they often suffer from delayed constraint correction, leading to oscillatory behavior and prolonged safety violations. In this paper, we propose Constraint-Sensitive Policy Optimization (CSPO), a first-order primal-dual method that incorporates local constraint sensitivity into policy updates. CSPO augments the primal objective with a constraint-sensitive correction derived from the shortest signed distance to the safety boundary, enabling smarter recovery steps back to safety, compensating for delayed Lagrange multiplier updates, reducing oscillations near the boundary, and preserving the KKT solutions of the original constrained problem. Experiments on navigation and locomotion benchmarks demonstrate that CSPO achieves faster safety recovery and high reward preservation, resulting in higher constrained returns compared to state-of-the-art primal-dual and penalty-based methods

[AI-24] Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

链接: https://arxiv.org/abs/2606.14409
作者: He Zhang,Lingzhu Xiang,Haitao Lin,Zeyu Huang,Minghui Wang,Dingyan Zhong,Yubo Dong,Yihao Wu,Yongming Rao,Dongsheng Zhang,Wanjia He,Ling Chen,Kai Huang,Jiahao Chen,Sichang Su,Xumin Yu,Ziyi Wang,Chengwei Zhu,Xiao Teng,Yuchun Guo,Yufeng Zhang,Yuandong Liu,Rui Wang,Zisheng Lu,Han Hu,Zhengyou Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.

[AI-25] Discovery under Hypothesis Redundancy: A Geometric Theory of Discovery Bottlenecks

链接: https://arxiv.org/abs/2606.14386
作者: Li Xia,Baoxun Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM)
备注: 23 pages, 1 figure, 27 tables

点击查看摘要

Abstract:Scientific discovery saturates when new hypotheses cease to provide independent information, even if the nominal hypothesis space remains large. We study hybrid discovery systems that combine structured local search with LLM-generated non-local proposals and pose the Search Compression Hypothesis: non-local exploration helps only when three geometric conditions co-occur: spectral compression, orthogonal escape from the explored span, and residual signal alignment with the target. We formalize these conditions, derive necessary conditions for hybrid advantage, and test the mechanism in controlled synthetic environments, large-scale A-share factor discovery, and symbolic-regression benchmarks; a public tabular operational sanity check tests the associated budget-allocation implication. Signal-planting and directed-versus-random experiments show that novelty alone is insufficient: random orthogonal jumps expand coverage but do not improve yield without predictive alignment. Across compression sweeps, real factor archives, and LLM-SRBench tasks, hybrid gains concentrate in weakly represented but target-bearing directions and vanish as the hypothesis space approaches full rank. The framework turns LLM-guided discovery from generic novelty search into a diagnostic procedure for deciding when directed non-local exploration is warranted.

[AI-26] Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models

链接: https://arxiv.org/abs/2606.14375
作者: Ge Wang,Xinyu Tan,Xiang Li,Man Luo,Chengsi Yao,Shenhao Yan,Jiahao Yang,Fan Feng,Honghao Cai,Xiangyuan Wang,Zhixin Mai,Yiming Zhao,Yatong Han,Zhen Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more computation and fresher feedback, while easier states can often be handled with fewer inference steps and longer open-loop execution. We propose Elastic Queries Reinforcement Learning (EQRL), a framework that makes each VLA policy query elastic. A lightweight latent-schedule adaptor jointly selects the latent input, denoising budget, and action chunk length, without fine-tuning the underlying VLA model. To make scheduling difficulty-aware, EQRL trains a critic over the joint latent-schedule action and derives a state difficulty signal from critic ensemble disagreement. This signal guides compute toward difficult states, while a learned residual allows task-driven correction. We formulate variable chunk execution as query-level macro-action RL with chunk-dependent discounting and an amortized number-of-function-evaluations (NFE) budget. Across simulation and real-robot manipulation, EQRL reduces amortized inference cost while preserving or improving task success.

[AI-27] No Accidental Software Agent First Canonical Code for Human Code Entropy Reduction and 30 to 500 times Lower Frontier Model Requirements

链接: https://arxiv.org/abs/2606.14357
作者: Jepson Taylor
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 36 pages

点击查看摘要

Abstract:Frontier coding models may spend substantial capacity learning not only program behavior, but also accidental entropy in human repositories. Such repositories contain valuable signals: tests, incidents, migrations, edge cases, product judgment, and operational history. These signals are entangled with framework churn, naming drift, generated-source ambiguity, dependency rituals, CI dialects, weak proof routes, and human-oriented review customs. We propose agent-first canonical code, a proof-carrying substrate that rewrites routine product software into canonical behavior profiles, typed change algebra, proof lanes, constrained edit grammars, semantic patch cells, runtime negative memory, and proof-carrying change objects. The core hypothesis is that quotienting software by behavior equivalence under a declared oracle can collapse equivalent encodings into governed representatives with explicit evidence and proof obligations. The endpoint is amortized cost per verified correct change, including source, context, reasoning, tools, verification, security, provenance, review, failed loops, defects, and foundry cost under a common oracle. Reported reduction bands are hypotheses, not measured frontier results. The proposed limit is a No-Accident Horizon: removable accident decreases until residual novelty, evidence, governance, risk, and future optionality dominate. For supported routine-product distributions, this gives a defensible planning target near 100-fold all-in cost reduction, not a guarantee for all software. Preliminary QLoRA experiments on Qwen2.5-Coder-14B show that 64,088 canonical trajectories are learnable and suppress tested forbidden-language markers, but do not establish behavior preservation, scaling economics, or verified-change cost. The contribution is a falsifiable program centered on minimum functional description length and verified-change cost. Comments: 36 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.14357 [cs.SE] (or arXiv:2606.14357v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.14357 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-28] PLAIground: SLO-Driven Runtime Model Selection for Compound AI Systems in the Edge-Cloud-Space Continuum

链接: https://arxiv.org/abs/2606.14356
作者: Milos Gravara,Cynthia Marcelino,Andrija Stanisic,Stefan Nastic
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Applications in the 3D Computing Continuum, which unifies edge, cloud, and space, require combining multiple AI tasks such as object detection, time-series analytics, and natural language processing into Compound AI systems. These systems must satisfy stringent Service Level Objectives (SLOs) on accuracy, latency, and cost. A key mechanism for maintaining SLO compliance of Compound AI systems is runtime model selection, where AI models are dynamically switched for each workflow task. However, existing distributed and compound AI frameworks do not natively support runtime model selection. We present PLAIground, a framework that enables runtime model selection for Compound AI systems. PLAIground introduces Compoundable AI Model (CAIM) abstraction, which decouples task semantics from AI model implementations via Task and Data Contracts, enabling model switching without workflow changes. Additionally, PLAIground introduces Pixie, an SLO-driven runtime model selection algorithm, which dynamically selects the most suitable model for each task during execution. Our evaluation on two realistic Compound AI workflows demonstrates that Pixie achieves up to 91.3% accuracy while maintaining SLO compliance where fixed-model strategies either violate cost and latency budgets up to 21x or miss accuracy targets by 4%.

[AI-29] Design Methodology and Performance Trade-offs Management for Distributed and Compound AI Systems

链接: https://arxiv.org/abs/2606.14350
作者: Milos Gravara,Andrija Stanisic,Stefan Nastic
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) systems must typically satisfy service-level objectives including accuracy, latency, and cost. The prevailing model-centric approaches select a monolithic model at design time and apply identical computation regardless of input difficulty, cannot decompose tasks across specialized components, and have knowledge that is fixed at training time. During runtime, this can lead to performance degradation and increasing costs. Because the model is the main design variable, it determines the majority of system behavior, coupling operational objectives to a single design-time choice. Addressing these limitations requires shifting from model-centric to system-centric design. Compound AI systems realize this shift by orchestrating multiple models, algorithms, and tools as distributed AI systems through explicit control logic. The performance of such systems depends on their workflow topology, the models assigned to each task, and the parameters governing runtime behavior. We present a design methodology that organizes this space along two dimensions, workflow topology and configuration selection, and identifies eight design patterns, each consolidating techniques to address a specific limitation of monolithic deployment. We validate our methodology through three case studies. Across our case studies, Compound AI configurations approach accuracy of monolithic models within 2.5 to 4 percentage points while reducing latency by up to 60% and cost by up to 71%. We show that model selection and parameter configuration jointly determine system performance, but the resulting design space grows combinatorially, as workflows compose more patterns and components. Thus, we identify five open challenges that define a roadmap from manually configured prototypes towards systems that automatically discover and maintain SLO-compliance in Compound and Distributed AI systems.

[AI-30] Squeeze-Release: Iterative Pruning with Exact Structural Minimization

链接: https://arxiv.org/abs/2606.14346
作者: Roman Denkin,Ida Akerholm,Prashant Singh,Ida-Maria Sintorn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.

[AI-31] Im Sorry Driver Im Afraid I Cant Do That: Appraising the Safety of LLM s within Automotive Contexts

链接: https://arxiv.org/abs/2606.14327
作者: Shaun Feakins,Ibrahim Habli,Kim Littler,Robert Palin
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Accepted at the Dependable AI in Embedded Systems (DAIES) Workshop at SAFECOMP 2026; 15 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This paper appraises recent frameworks within AI development to integrate LLMs into control tasks in automotive contexts from the perspective of safety assurance. This work has built upon the rapid integration of LLMs across automotive settings. However, we find that at present, these frameworks face significant challenges, limiting their efficacy in real-time safety-critical contexts. Firstly, we consider conceptual challenges, including the fact that deployers are faced with a dual challenge, wherein they must assure a model which has been developed upstream, i.e. as general-purpose tools by the large AI labs, in a downstream context, i.e. into specific vehicle architectures. Secondly, we consider concrete challenges from across existing standards. We show that there are currently both fundamental engineering constraints covered in ISO21448, such as latency, and novel LLM-specific issues, such as alignment-related issues covered in ISO/PAS8800. We ground both examples in a concrete introductory, experimental case study exploring an existing open-source repository, Talk2Drive. We present a safety argument in order to make explicit the limitations of existing solutions. Nonetheless, given that the use of LLMs in automotive contexts is being explored at a technical level and operationalised, we propose potential assurance mechanisms for LLM-related hazardous events going forward.

[AI-32] Communication Policy Evolution for Proactive LLM Agents

链接: https://arxiv.org/abs/2606.14314
作者: Xinbei Ma,Jiyang Qiu,Yao Yao,Zheng Wu,Yijie Lu,Xiangmou Qu,Jiaxin Yin,Xingyu Lou,Jun Wang,Weiwen Liu,Weinan Zhang,Zhuosheng Zhang,Hai Zhao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users’ identical preferences further limit information exchange. To investigate how agents should communicate across modalities, this paper formalizes Communication Policy, establishes textual and UI-based policies, and then evaluates communication policies across diverse environments, personas, and model combinations. Building information asymmetry for proactive agents, we set up two complementary settings, User-Agent and Planner-Executor. Experimental results reveal complementary strengths between interaction channels: text-based interaction often facilitates task performance, while structured UI improves agents’ response quality and persona compliance. Motivated by that, a hybrid method combines these advantages. We further propose Communication Policy Evolution (CPE), a self-evolution framework for refining communication policies through rollout and prompt-level evolving. Without model modification, CPE achieves the best task success across multiple settings using prompt refinement alone. Our findings identify communication behavior as a critical yet underexplored design dimension for LLM agents.

[AI-33] ransforming Shape Schemas with Composable Property-Graph Queries (Extended Version)

链接: https://arxiv.org/abs/2606.14309
作者: Philipp Seifer,Daniel Hernández,Ralf Lämmel,Steffen Staab
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Property graphs may be constrained by schemas that inform both query engines and human users about the shape of valid data, enforcing a contract between data provider and consumer. Composable property-graph queries transform input graphs into output graphs. Then, the question arises of which schema can be expected after one (or several) transformation steps. We investigate how schema constraints can be inferred given an input schema and a transforming query. Specifically, we propose a reasoning procedure that, given an input schema in ProGS and a query in G-CORE infers an output schema. Since graph updates will happen frequently, our inference procedure does not rely on graph instances, such that the computed output schema applies to all graphs originating from any input graph complying with the input schema. Related work has addressed this problem for SPARQL CONSTRUCT queries, encoding it in Description Logics (DLs) so that the output schema is entailed by axioms inferred from input schema and queries. Property graphs and their queries, however, complicate the matter, as property graphs feature label and property annotations as well as first-class edges. Thus, reification has to be used in one way or another, though available DLs lack the means to encode such features directly. We approach this novel challenge via a family of mappings for i) property graphs reified in RDF, aligned with ii) a mapping from ProGS to SHACL and iii) a mapping from G-CORE to SPARQL CONSTRUCT queries. In this manner, schema inference for property graphs becomes manageable, as we break apart the problem through the extra mapping layer and utilize efficient DL reasoners. We develop the metatheory regarding the soundness of inferred schema constraints and the semantic equivalence of mapped schemas and queries.

[AI-34] Agent CyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

链接: https://arxiv.org/abs/2606.14295
作者: Fengyu Liu,Jiarun Dai,Yihe Fan,Wuyuao Mai,Ziao Li,Bofei Chen,Jie Zhang,Zheng Lou,Bocheng Xiang,Qiyi Zhang,Xudong Pan,Geng Hong,Yuan Zhang,Min Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Frontier AI systems are increasingly capable of cybersecurity tasks, including codebase inspection, vulnerability detection, and exploitation. However, evaluating their offensive capabilities remains constrained by limited access to open, reproducible, multi-host cyber ranges. Existing public benchmarks capture isolated skills such as CTF solving, vulnerability reproduction, and exploit generation, but often abstract away realistic intrusion workflows: discovering exposed services, gaining a foothold, collecting internal information, and expanding compromise across hosts. This gap makes it difficult to observe emerging risks early, because frontier AI systems are rarely evaluated under realistic attack conditions. We introduce AgentCyberRange, the first open, multi-range infrastructure for measuring autonomous cyber attack capability in realistic cyber ranges. It combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges with 156 internal hosts, plus Cage, a toolchain for execution, orchestration, result collection, and verification. The benchmark covers two core stages: web exploitation, where agents explore exposed applications and validate vulnerabilities, and post exploitation, where agents turn an initial foothold into broader internal compromise. We evaluate six frontier AI systems under matched prompts and budgets. GPT-5.5 with Codex performs best, solving 16.1% of web exploitation tasks and 31.7% of post-exploitation tasks; with more concrete hints, these rates increase to 33.0% and 46.3%. We also observe out-of-benchmark findings, including unknown vulnerabilities in popular projects, and payload mutation that bypasses host defenses. These results show that open cyber-range evaluation is necessary for observing emerging offensive capabilities under realistic and reproducible conditions. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.14295 [cs.CR] (or arXiv:2606.14295v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.14295 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-35] Hierarchical ODE: Learning Continuous-Time Physical Prototypes for Early Link Failure Detection

链接: https://arxiv.org/abs/2606.14284
作者: Jiaen Lv,Leran Qi,Shaowei Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: International Conference on Machine Learning 2026

点击查看摘要

Abstract:Time series prototype learning is fundamentally challenged by observational ambiguity. Discrete architectures fail to resolve this, as they lack the capacity to decouple stochastic noise from continuous dynamics. Furthermore, rigid closed-set assumptions fail to capture unseen diversity. To address these limitations, we propose a hierarchical ordinary differential equation clustering network, which utilizes neural ordinary differential equation to model latent state evolution as a continuous integral curve. This formulation enforces temporal continuity to effectively disentangle smooth feature trends from stochastic noise, while our adaptive hierarchical mechanism autonomously determines the appropriate number of prototypes without rigid prior constraints. Validated on the early link failure detection task with irregularly sampled time series, the proposed method effectively extracts underlying physical prototypes, thereby enabling robust failure detection. Our code is available at this https URL.

[AI-36] DIFF-ERO: A Conformance-Aware Loss for Deep Learning in Process Mining

链接: https://arxiv.org/abs/2606.14283
作者: Johannes De Smedt,Jari Peeperkorn,Artem Polyvyanyy,Jochen De Weerdt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 24th International Conference on Business Process Management

点击查看摘要

Abstract:Deep learning has driven many recent advances in process analytics, especially for predictive and prescriptive monitoring. However, standard objectives such as cross-entropy optimize local next-step likelihoods and only implicitly capture control-flow structure. As a result, models can achieve high token-level accuracy while permitting imprecise global behaviour. We introduce DIFF-ERO, a conformance-aware loss function for deep learning models on process data. DIFF-ERO is a differentiable formulation of entropy-based stochastic conformance that incorporates control-flow information during training. Our approach constructs batch-level stochastic transition matrices with soft edge memberships, allowing structural precision and recall signals to directly inform backpropagation. The loss is model-agnostic and can be applied whenever the final representation parametrizes stochastic transitions. We instantiate DIFF-ERO in transformer encoder-decoder pipelines for next-activity prediction and use it jointly with cross-entropy to analyse its theoretical components with respect to convergence. Across benchmarks comparing other loss functions and targets, DIFF-ERO shows improved predictive performance where structure matters most while maintaining parity elsewhere. At the same time, the learned stochastic automaton converges towards the structural ground truth, indicating that the network internalizes process model structure.

[AI-37] Robust Fall Recovery for Armless Bipedal-Wheeled Robots Via Force-Guided Learning

链接: https://arxiv.org/abs/2606.14270
作者: Haidong Hou,Zhangguo Yu,Tao Han,Hengbo Qi,Khaleel Ghazal,Yu Zhang,Yidong Du,Xuechao Chen,Fei Meng
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, accepted by IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:Fall recovery is critical for autonomous legged locomotion. Existing methods have demonstrated that some legged robots, such as humanoids and quadrupeds, are capable of fall recovery from diverse postures by utilizing arms or coordinating multi-legs to generate support forces. Without arms or other legs to provide supportive assistance, a bipedal-wheeled robot must rely solely on the actuation of its legs, making recovery particularly difficult. To address this, we introduce FTSR (Force-guided Teacher-student framework with Stage-wise Rewards). The force-guided method constructs an external auxiliary force during simulation training that correlates directly with the robot’s real-time height, explicitly formulating this force as an optimizable constraint. Through constrained reinforcement learning, the policy is guided toward reducing force dependency gradually and increasing the body height, developing internal recovery strategies despite having no arms for support. Height-progressive stage-Wise rewards progressively structure posture stabilization during recovery and transition to sustained locomotion, integrated with teacher-student architecture distilling privileged knowledge of force effects and recovery dynamics. After simulation training, the policy is deployed on a physical armless bipedal-wheeled robot and extensively evaluated. Experiments confirm robust and reliable fall recovery under diverse challenging conditions, demonstrating strong environmental adaptability and motion robustness, while maintaining full post-recovery motion capability. The framework also generalizes effectively to a high-DOF humanoid, confirming its practical generalizability. The project page is available at this https URL

[AI-38] HarnessX: A Composable Adaptive and Evolvable Agent Harness Foundry

链接: https://arxiv.org/abs/2606.14249
作者: Tingyang Chen,Shuo Lu,Kang Zhao,Weicheng Meng,Hanlin Teng,Tianhao Li,Chao Li,Xule Liu,Jian Liang,Zhizhong Zhang,Yuan Xie,Heng Qu,Kun Shao,Jian Luan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today’s harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.

[AI-39] AFFORDANCE20Q: Evaluating Affordance Reasoning from Physical Properties

链接: https://arxiv.org/abs/2606.14240
作者: Yifan Jiang,Meige Yang,Zitong Li,Jay Pujara
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Affordance reasoning, the inference of an object’s action possibilities from its physical properties (e.g., shape and material), is fundamental to human physical understanding and increasingly critical for Large Language Models (LLMs). However, existing affordance benchmarks largely expose explicit object identities in the evaluation setup, allowing models to rely on memorized object-affordance mappings rather than reasoning over physical properties. To address this gap, we introduce Affordance20Q, a novel affordance reasoning benchmark formulated as a 20-Questions game without exposing the object’s identity. In each game, the model identifies a hidden object’s affordance from a candidate set by asking yes/no questions about its physical properties. Affordance20Q comprises 1,009 games over 454 objects and 59 affordances, all manually filtered, refined, and annotated. We conduct comprehensive experiments with 15 state-of-the-art LLMs and find a substantial gap (~20 points) compared to human performance. A KL-based information-gain (IG) analysis further shows that models fail to ask discriminating questions as the game progresses. To close the gap, we develop KB-Anchored Rule Induction (KARI), a pipeline based on LLMs that generates affordance rules grounded in evidence from knowledge bases (KBs). KARI improves open-source LLMs by up to 15.2 points, while the limited coverage of KBs hinders further gains. We release all our code and data at this https URL

[AI-40] SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

链接: https://arxiv.org/abs/2606.14239
作者: Haowen Gao,Haoran Chen,Can Wang,Shasha Guo,Liang Pang,Zhaoyang Liu,Huawei Shen,Xueqi Cheng
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards – signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground-truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

[AI-41] When and How Severely: Scenario-Specific Safety Envelopes for Driving VLAs

链接: https://arxiv.org/abs/2606.14238
作者: Abhinaw Priyadershi,Jelena Frtunikj
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety certification of Vision-Language-Action (VLA) driving planners under ISO 21448 (SOTIF) rests on an Operational Design Domain (ODD) specification that answers two complementary questions: when does the planner start to fail, and how severely does it fail once it does? We evaluate Alpamayo R1, a 10B-parameter open-weight driving VLA, on 15,968 (clip, attack) pairs. We find a conservative-aggregate gap: an aggregate safe threshold of \sigma \leq 50 under a 15% average displacement error (ADE) budget masks well-sampled scenarios that tolerate the top of the tested grid ( \sigma = 70 ). A Gaussian Mixture Model (GMM) on the changed-explanation subset identifies six discrete severity bands (BIC-optimal k=6 ), so two perturbation conditions with the same mean error can differ materially in their share of high-severity (C4/C5) failures. Joining the two analyses on the same corpus surfaces a finding neither yields in isolation: the scenarios with the loosest noise thresholds are not those with the lowest high-severity rate: STOP_SIGNAL concentrates roughly 4\times the C4/C5 share of LANE_KEEPING despite tolerating a larger \sigma . A deployable SOTIF ODD specification for driving VLAs therefore requires a two-dimensional safety envelope, not a single aggregate value per hazard.

[AI-42] Selective Agent ic Recovery for UAV Autonomy with a Persistent Mission Runtime

链接: https://arxiv.org/abs/2606.14219
作者: Taewoo Park,Kyeonghyun Yoo,Seunghyun Yoo,Hwangnam Kim
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures. Preprint

点击查看摘要

Abstract:Agentic AI can support unmanned aerial vehicle (UAV) autonomy by providing high-level recovery reasoning when local waypoint- or setpoint-based execution encounters blocked passages, repeated no-progress behavior, or mission-level ambiguity. On physical UAVs, however, remote reasoning is most useful when it is invoked selectively, since each call introduces latency, resource cost, backend uncertainty, and a need to validate the returned decision. This paper presents Persistent Mission Runtime (PMR), a UAV recovery framework that keeps the mission loop and safety-critical execution local while using an external agentic reasoner only as an on-demand recovery module. The reasoner selects from predefined recovery skills, and each returned decision is parsed, verified, safety-filtered, and mapped to local executor actions before it can affect flight. PMR introduces learned Cognitive Value of Invocation (learned-CVI), a compact admission gate that estimates when remote agentic reasoning is likely to improve near-term mission progress enough to justify its operational cost. Across a fixed 400-run Gazebo/PX4 benchmark with eight scenarios, learned-CVI raises hard/ambiguous-regime success from 5.0% under local-only autonomy to 95.0%, outperforms one-shot and periodic reasoning baselines by 20.0 and 32.5 percentage points, and reduces remote-agent calls by 16.7% and logged tokens by 29.2% relative to a manually tuned rule-based invocation baseline.

[AI-43] Universal Manipulation Exoskeleton: Learning Compliant Whole-body Policies with Real-time Torque Feedback

链接: https://arxiv.org/abs/2606.14218
作者: Litian Liang,Jingxi Xu,Xinda Qi,Yujun Cai,Houzhu Ding,Luqi Wang,Zhixin Sun,Jyh-Herng Chow,Ming Yang,Mark Cutkosky
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:For robots to work safely in household environments, they need to be compliant and react to torque and force feedback during contact. However, the majority of existing data collection pipelines still lack the ability to capture force and torque data for learning active compliant policies. In this paper, we present Universal Manipulation Exoskeleton (UME), an upper-limb exoskeleton that provides real-time haptic torque feedback while recording whole-arm configurations and joint torque signals for teleoperation. With transparent torque feedback, human operators can even unsheathe kinematically constrained objects while blindfolded. UME is low-cost, lightweight, and portable. Equipped with an embedded IMU, it enables teleoperation for mobile manipulation. With our proposed universal retargeting algorithm, UME can teleoperate a range of robots, including the 7DoF OpenArm, 7DoF Franka, and 6DoF X-ARM. We demonstrate that this combination of capabilities enables learning bimanual, whole-body, and active compliant policies that operate effectively in highly constrained spaces. The learned robust autonomous policies achieve high success rates across a variety of tasks, including long-horizon mobile manipulation, force-mediated box flipping, visually occluded box pushing, and space-constrained tabletop manipulation. Videos, code, and additional information can be found at this https URL.

[AI-44] Closing the Reflection Gap: A Free Calibration Bonus for Agent ic RL

链接: https://arxiv.org/abs/2606.14211
作者: Yinglun Zhu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to accurately assess its own performance. Yet we find a persistent reflection gap: LLM agents tend to mis-assess their own outputs after observing concrete environment feedback – even for questions they correctly answered – and standard RL barely helps due to a credit-assignment mismatch. To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent’s own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e.g., reduces underconfidence rate 44.4% \to 7.7% ) and task accuracy (e.g., 75.1% \to 76.5% ) on text-to-SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self-improvement that uses reflections as pseudo-rewards without outcome supervision, and (ii) more effective test-time selective prediction by committing only to rollouts flagged as correct.

[AI-45] From Prompts to Responses: Dual-Sided Data Leakage and Defense in Split Large Language Models ICML2026

链接: https://arxiv.org/abs/2606.14210
作者: Zixuan Gu,Xiaojun Ye,Yang Liu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages, Accepted at ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in privacy-sensitive domains, where users must balance the risk of data exposure through external APIs against the high computational cost of local deployment. Split learning has therefore emerged as a promising paradigm for LLM fine-tuning and inference under limited local resources. However, it introduces new privacy risks. Prior work primarily studies leakage of private input prompts, typically via inversion attacks on intermediate representations, while the potential for sensitive information leakage through generative response outputs remains largely unexplored. In this work, we unveil novel vulnerabilities of Split-LLM by presenting Patched Model Inversion with Dual-Sided Initialization (PIDI), a two-stage attack that simultaneously targets both private input prompts and output responses in Split-LLM settings. It combines dual-sided initialization with a patched inversion strategy to tackle long sequences, substantially outperforming prior inversion methods. To counter threats from both sides, we further propose the Adapter-based DualGuard with Mutual Information Defense (ADMI), which integrates an adapter-based local warmup strategy and mutual information regularization to provide a strong empirical privacy protection with minimal impact on task performance. Extensive experiments across diverse tasks and models demonstrate that ADMI effectively defends against PIDI and other state-of-the-art inversion attacks. Our code is publicly available at this https URL. Comments: 18 pages, Accepted at ICML 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.14210 [cs.CR] (or arXiv:2606.14210v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.14210 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-46] MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

链接: https://arxiv.org/abs/2606.14202
作者: Zishang Qiu,Xinan Chen,Rong Qu,Ruibin Bai
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

[AI-47] When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

链接: https://arxiv.org/abs/2606.14200
作者: Yihan Xia,Taotao Wang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Open platforms increasingly route tasks among heterogeneous LLM agents–differing in base model, scaffold, and tool stack–whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The standard reputation approach summarizes each agent by a single global trust score, but that scalar is the wrong object here, because routing every task to the globally most-trusted agent leaves the value of specialization unclaimed. We study skill-conditional trust R(i | k)–the trust to place in agent i for a task requiring skill k, rather than one score per agent–and pose three falsifiable questions: when is conditioning worth it, how much cross-skill evidence should be borrowed, and whether that borrowing is safe. A controlled phase-diagram analysis answers the first two: conditional trust wins only in a specific regime–high agent heterogeneity, sparse per-skill evidence, and correlated skills–and the coupling strength beta that buys this data efficiency is dual-use, because the same cross-skill borrowing is also a laundering channel. On a public benchmark of 14 genuinely heterogeneous AppWorld agents, real pools land inside the beneficial regime–a small but genuine gain, with the per-skill best agent genuinely changing across skills. We then show that an attacker with cheap evidence in one skill and none in a target skill hijacks the conditional router, driving routing regret from 0 to 0.94 on a pool our zero-cost Conditional Information Value Test (CIVT) rates GREEN–while the ungated trust verdict it contaminates reads -0.06 instead of the honest +0.19. A zero-evidence gate bounds the attack but does not eliminate it; we characterize the residual cost under an explicit budget. We do not claim Sybil-resistance–we quantify the trade-off.

[AI-48] Robustness without Wrinkles: Parallel Simulation and Robust MPC for Certified Deformable Manipulation

链接: https://arxiv.org/abs/2606.14188
作者: Wei-Chen Li,Jeffrey Fang,Sasanka Polisetti,Yuexi Song,Glen Chou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We present CORD-SLS, a real-time control method for safe deformable object manipulation, with a focus on ropes and cloth. At its core is a GPU-parallel differentiable simulator with contact smoothing which enables efficient gradient-based planning through intermittent contact. To robustly satisfy constraints under model and sensing uncertainty, we develop a real-time, GPU-parallel output-feedback robust model predictive control (MPC) algorithm that plans with this simulator. We further show that the simulator accelerates model-based RL for training neural manipulation policies. To improve real-world robustness, we use conformal prediction to calibrate visual-feedback and perception-error bounds for MPC, producing reachable tubes that enable high-probability safe control. We evaluate CORD-SLS on high-dimensional, contact-rich rope and cloth manipulation tasks in simulation and hardware, including obstacle avoidance, routing, folding, and smoothing. Across settings, CORD-SLS achieves millisecond-speed planning, exceeding baselines in safety, speed, and task success.

[AI-49] VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

链接: https://arxiv.org/abs/2606.14176
作者: Xiaoxian Duan,Zequn Liu,Yingce Xia
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Geometry problem generation is useful for AI-assisted education and multimodal mathematical reasoning, but reliable synthesis remains difficult because the problem statement, diagram, constraints, and solution should be mutually consistent. Existing methods often trade off controllability and reliability: seed-based rewriting is flexible but weakly verifiable, whereas diagram-first construction improves validity but is less suited to arbitrary user-specified constraints. We introduce VeriGeo, a controllable geometry generation framework grounded in executable reasoning traces. Given user constraints such as target concepts and difficulty, an Author agent generates a problem and diagram, and a Solver agent produces a proof-aligned solution. Both agents use a shared action sequence that connects natural language, diagrams, geometric constraints, and proof steps into a verifiable representation. A three-stage pipeline checks numerical consistency, analytical realizability, and global consistency, using verification-guided reflection to repair recoverable failures and reject unrecoverable ones. Across five LLM backbones, raw generations frequently fail these checks, while VeriGeo repairs a substantial fraction of the invalid attempts. Supervised fine-tuning on 8.7k examples generated by VeriGeo achieves the best reported GeoQA performance among end-to-end multimodal LLM-based solvers, and obtains strong results on PGPS9K and MathVista-GPS, demonstrating the effectiveness of verified synthetic data for improving multimodal geometry reasoning.

[AI-50] Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

链接: https://arxiv.org/abs/2606.14157
作者: Paula Joy B. Martinez
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Oral Presentation. 2026 International Conference on Urban AI

点击查看摘要

Abstract:Cities deliver basic services through mixed public-private facility networks, including schools, clinics, transit providers, and subsidized service points. In these systems, planners often observe where households go, but not the latent cost function through which they trade off factors such as distance, price, and institutional access. We study this urban problem through school choice in the Philippines, where the country’s largest national education subsidy is intended to redirect learners from congested public schools to participating private schools. Treating school-to-school enrollment flows as an entropic optimal transport plan, we recover latent choice costs using two complementary inverse optimal transport models: an interpretable distance-banded model with a subsidy term, and a neural cost model trained through a differentiable Sinkhorn forward pass. Applied to 283,016 learner trips across 23,820 observed flows in the most populated region, the framework estimates a subsidy-equivalent distance, \lambda^(k) , interpreted as the kilometers of perceived travel cost offset by the subsidy. The case demonstrates how administrative origin-destination data can be transformed into interpretable planning metrics for accessibility-aware subsidy design, facility siting, and urban service allocation.

[AI-51] Learning High Coverag e Discriminative Parsimonious Rulesets

链接: https://arxiv.org/abs/2606.14156
作者: Mariamma Antony,Raman Sankaran,Chiranjib Bhattacharyya,Uma Satya Ranjan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning systems based on IF-THEN rule representations readily offer interpretability, making them a crucial focus in contemporary AI research. A key objective for such rule sets is to achieve both high discriminative power and interpretability. While existing state-of-the-art algorithms implicitly prioritize predictive accuracy, they often fall short on one or more quality metrics that ensure interpretability, such as coverage and parsimony of rule sets. Motivated by this, this paper propose the development of CDPR, which aims to create highly accurate and interpretable rule sets for classification problems. To the best of our knowledge, this represents the first attempt to establish such an approach. In this study, we introduce two algorithms rooted in submodular maximization, which not only provide provable guarantees on coverage but also yield rule sets that are both discriminative and parsimonious. We empirically demonstrate that rule sets learned through our approaches achieve higher accuracy and interpretability and has more than a 2.5-fold improvement in average coverage rates when compared to the next best algorithm.

[AI-52] Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage KDD2026 ECML

链接: https://arxiv.org/abs/2606.14123
作者: Xiaoran Yan,Cheng Tang,Atsushi Shimada
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures. Accepted at ECML PKDD 2026 (Research Track). Code: this https URL

点击查看摘要

Abstract:Deployed knowledge-tracing models are typically frozen after training, yet systematic per-item logit bias arises, from limited per-item expressivity in backbone architectures and from post-deployment shifts in item properties, degrading prediction quality. Global post-hoc calibrators such as Platt scaling, temperature scaling, and isotonic regression improve probability estimates but leave discriminative ability, as measured by AUC, unchanged. This AUC invariance is a structural consequence of monotone score-only transforms; recovering the stranded discrimination requires conditioning on item identity. We propose SLC (State-space Logit Correction), which converts binary observations to Gaussian pseudo-observations via Laplace/IRLS, applies empirical-Bayes shrinkage through a Kalman smoother, and fits an offset-Platt link. The state-space formulation also yields a detectability bound that characterizes the Bernoulli information floor, explaining why temporal tracking provides no benefit at current data densities. Across four datasets, five backbones, and three seeds, SLC improves AUC on all four datasets and NLL on three, with the advantage concentrating on sparse items. Cross-domain controls suggest that the same phenomenon can arise beyond education when the deployed backbone leaves entity-level bias.

[AI-53] FactoryLLM : A Safe and Open-Source AI Playground for Evaluating LLM s in Smart Factories

链接: https://arxiv.org/abs/2606.14119
作者: Yash Pulse,Yong-Bin Kang,Abhik Banerjee,Abdur Forkan,Prem Prakash Jayaraman
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, IEEE INDIN 2026

点击查看摘要

Abstract:Fault diagnostics and recovery in smart factories is challenging because critical information is dispersed across manuals of multiple machines which are interconnected through the manufacturing process. Large Language Models (LLMs) can provide a promising approach. In this paper, we propose FactoryLLM, a safe and open-source AI playground designed for evaluating different LLM-based retrieval-augmented generation (RAG) models by analysing documents from multiple machines across the manufacturing process. FactoryLLM enables the user to configure the LLM, and assess performance when reasoning over multiple documents, through a dual evaluation setup using both RAGAS and NVIDIA’s LLM-as-a-Judge metrics. FactoryLLM is safe because it allows users to run local or open-source LLMs without sharing sensitive industrial data, providing a controlled environment for experimentation. We demonstrate the efficacy of FactoryLLM through a case study which involves an Autonomous Intelligent Vehicle and its Mobile Planner software, evaluating three LLMs across 30 maintenance queries derived from approximately 600 pages of cross-machine documentation. The results suggest that FactoryLLM is effective in cross-machine document reasoning: every model achieved a groundedness score above 0.88. The full code and documentation for community to test FactoryLLM with their manufacturing specific scenarios are publicly available.

[AI-54] Numbers Already Carry Their Own Embeddings NEURIPS2025

链接: https://arxiv.org/abs/2606.14108
作者: Suhyun Bae,Donghun Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented at the MATH-AI Workshop at NeurIPS 2025

点击查看摘要

Abstract:We introduce Adelic operation-preserved embeddings (AOE), a training-free representation that captures both a number’s real value and its modular (p-adic) signatures. This construction preserves additive and multiplicative structure by design, turning numerical input into embeddings that “speak in the language of mathematics.” Unlike prior approaches that rely on task-specific retraining, AOE is plug-and-play and drops seamlessly into existing architectures. On algebraic combinatorics benchmarks, it delivers consistent gains including the first-ever perfect accuracy on the Weaving Pattern task-while suggesting a principled path forward for overcoming the long-standing “number problem” in AI.

[AI-55] Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning CCS2026

链接: https://arxiv.org/abs/2606.14078
作者: Zhenqian Zhu,Yamin Hu,Yujiang Liu,Luping Wei,Wenbo Hou,Bin Li,Haodong Li,Wenjian Luo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACM CCS 2026

点击查看摘要

Abstract:Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety protection, as they fall short of completely eliminating the backdoor effects. In this work, we present a novel formulation of backdoor learning and unlearning as a sequential, three-stage process from a continual learning perspective. Within this framework, we formally define complete backdoor unlearning and further derive the necessary conditions for achieving it based on the mechanism of catastrophic forgetting. Guided by these insights, we propose Blind Inversion-Backdoor Adversarial Unlearning (BI-BAU), which formulates the generation of adversarial examples satisfying the unlearning conditions as a blind inversion problem. We solve this by integrating the bi-level optimization process of adversarial training into an Expectation-Maximization (EM) algorithm framework to optimize the maximum a posteriori (MAP) objective. Furthermore, BI-BAU is extended to untargeted adversarial scenarios with unknown target classes, as well as to multi-modal contrastive learning tasks, enhancing its applicability to real-world deployment scenarios where pre-trained models may be compromised. Extensive experiments demonstrate that our method exhibits general applicability across a wide spectrum of backdoor attacks and can effectively and thoroughly eliminate the backdoor effects from a backdoor model.

[AI-56] Applicability Condition Extraction for Therapeutic Drug-Disease Relations

链接: https://arxiv.org/abs/2606.14031
作者: Guanting Luo,Noriki Nishida,Yuji Matsumoto,Yuki Arase
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identifying conditions that a certain drug takes therapeutic effect on a target disease is crucial for clinical decision-making support. However, most existing biomedical information extraction methods have focused on identifying only relations between drugs and diseases, while largely overlooking the context-specific conditions where such relations can apply. To address this problem, we introduce the task of applicability condition extraction for therapeutic drug–disease relations from biomedical research literature. We create the first dataset that has manually annotated triples of drugs, diseases, and applicability conditions on biomedical paper abstracts with 1,119 drug-disease pairs. Using this dataset, we systematically evaluate the performance of a range of existing methods. In addition, we propose a new method that enhances LoRA to consider relations between drugs and diseases. Our method consistently outperforms strong baselines across different evaluation settings. The source code and dataset of this paper can be obtained from: this https URL

[AI-57] Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance

链接: https://arxiv.org/abs/2606.14000
作者: Theodore Meek,Siyuan Ge,Di Qiu Xiang,Simon Chess,Vasily Ilin
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work has demonstrated that coding agents can formalize entire advanced mathematics textbooks in Lean 4, yet existing efforts concentrate on branches of mathematics already well-represented in mathlib and measure success solely through kernel acceptance. We address both limitations by applying a coding agent to formalize Numerical Methods for Ordinary Differential Equations, a textbook in numerical analysis that is largely absent from mathlib, stressing the agent’s capacity to develop new theory from scratch. We further introduce a systematic, reproducible three-dimensional framework for evaluating the quality of agent-produced formalizations beyond compilation: semantic correctness, Mathlib reuse, and cross-file reuse via LLM-as-judge methods. Applying this framework to our own formalization and to the released outputs of RepoProver and M2F, we uncover recurring unfaithful formalization patterns, including incomplete multi-part statements, added weakening hypotheses, and parameter restrictions, that kernel acceptance entirely obscures. Our results suggest that compilation-based metrics substantially overstate formalization quality, and we provide a reproducible audit methodology to support more rigorous evaluation of future autoformalization systems.

[AI-58] Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DECOMPBENCH

链接: https://arxiv.org/abs/2606.13994
作者: Vikhyath Kothamasu,Virginia Smith,Chhavi Yadav
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-based Agents are becoming increasingly capable and widely deployed, creating growing incentives for adversarial misuse in the real-world. A key emerging threat is Decomposition Attacks \citeglukhov2024breach, jones2024adversaries in which a harmful task is broken into simpler, benign subtasks that evade safety mechanisms when executed separately but cumulatively fulfill the malicious intent. Although recent benchmarks assess agent safety in multi-turn and multi-tool-use settings, they do not explicitly capture this form of decompositional misuse and may not represent realistic adversarial execution flows. To this end, we introduce DeCompBench, a benchmark designed specifically to evaluate agentic safety under decomposition attacks. DeCompBench is created with a decomposition-by-design principle using a graphical framework and enables harmful task decomposition into individually benign and executable subtasks with realistic workflows. Our experiments using a custom decomposer show that state-of-the-art agents exhibit high refusal rates on monolithic harmful tasks, but significantly lower refusal rates on their decomposed variants, while often inadvertently fulfilling the adversarial objectives. These findings underscore the need for safety evaluations against decomposition attacks and corresponding defenses. Our dataset is publicly available and can be found at this https URL.

[AI-59] Mask Sample Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech

链接: https://arxiv.org/abs/2606.13989
作者: Alef Iury Siqueira Ferreira,Lucas Rafael Stefanel Gris,Luiz Fernando de Araújo Vidal,Frederico Santos de Oliveira,Christopher Dane Shulby,Anderson da Silva Soares,Arlindo Rodrigues Galvão Filho
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference-time CTMC stack for alignment-free DFM-TTS. The stack combines predictor-free guidance to strengthen text conditioning, prompt-matched conditional coupling to align the probability path with the acoustic prompt, and SC-ReMask, a schedule-constrained remasking mechanism that introduces token-to-mask transitions so early de-masking decisions can be revised. These components require no post-hoc fine-tuning and operate in a single tau-leaping sampler. Controlled ablations show that this stack improves intelligibility and robustness in the low-NFE prompted setting, outperforming unguided and guidance-only samplers with substantially more steps.

[AI-60] STREAM: Multi-Tier LLM Inference Middleware with Dual-Channel HPC Token Streaming

链接: https://arxiv.org/abs/2606.13968
作者: Anas Nassar,Steve Mohr,Leonard Apanasevich,Himanshu Sharma
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, PEARC '26

点击查看摘要

Abstract:Researchers and practitioners working with large language models face a fragmented landscape: local models are free and private but hardware limits the model size and context windows a researcher can use; institutional HPC centers offer powerful GPU resources at no marginal cost and keep data within institutional boundaries, but operate behind firewalls and are designed for batch jobs rather than interactive use; commercial cloud APIs provide frontier-model quality on demand but impose significant cost and data retention policies unsuitable for sensitive research data. No existing system unifies all three. STREAM (Smart Tiered Routing Engine for AI Models) addresses this gap with four contributions: (1) a three-tier routing architecture combining local, HPC, and cloud inference with a local LLM-based complexity judge; (2) a dual-channel HPC streaming architecture that separates the Globus Compute control plane (authentication and job dispatch) from a WebSocket relay data plane (token delivery), enabling sub-second TTFT (0.54 s median, 21.1x over batch mode’s 11.40 s) through institutional firewalls without VPN or firewall rule changes, with end-to-end AES-256-GCM encryption ensuring the relay operator cannot read token payloads; (3) tier-aware context summarization that prevents long conversations from forcing simple queries onto expensive tiers; and (4) an HPC-as-API proxy mode that exposes HPC inference as an OpenAI-compatible endpoint callable from any standard client with no HPC expertise, a deployment pattern made practical only by the sub-second TTFT of contribution (2). Llama 3.2 3B achieves 85.1% free-tier retention on a 1,200-query benchmark spanning ten domains. Measured TTFT: 0.26 s local, 0.54 s HPC (relay), 1.68 s cloud.

[AI-61] Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization ICML2026

链接: https://arxiv.org/abs/2606.13949
作者: Hexuan Yu,Chaoyu Zhang,Heng Jin,Shanghao Shi,Ning Zhang,Y. Thomas Hou,Wenjing Lou
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026 (43rd International Conference on Machine Learning, Seoul, South Korea). Code available at this https URL

点击查看摘要

Abstract:Modern LLM-powered autonomous agents increasingly rely on rich user interface (UI) state observations to achieve reliable action grounding in complex digital environments. However, many deployments transmit the full UI state to remote inference servers even when most elements are irrelevant to the current task, which can leak sensitive but unnecessary context such as authentication codes, private notifications, and background application states. We propose MINIM, a trusted local broker that performs privacy-aware minimization on the client side before any observation leaves the device. Grounded in Contextual Integrity (CI), MINIM learns a dual-score representation for each UI element by predicting an inherent sensitivity score (s) and a task-conditioned necessity score (n). These scores drive a ternary disclosure policy that keeps essential elements, abstracts sensitive attributes when needed, and removes task-irrelevant content. We optimize a CI-aware objective that penalizes necessity errors more strongly on high-risk content, enabling aggressive pruning while preserving task-critical information. Experiments on real-world UI observations derived from WebArena show that MINIM substantially reduces task-irrelevant sensitive leakage while preserving task-critical semantic context and the interactive affordances required for reliable agent actions.

[AI-62] Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

链接: https://arxiv.org/abs/2606.13934
作者: Jennifer Meng Lu,Ruochen Zhang,Isabelle Lee,David Alvarez-Melis,Ellie Pavlick,Naomi Saphra
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans cannot always intuit what scenarios are most challenging to LLMs. Hoping to capture challenging edge cases, developers either design problems to be difficult for humans or curate extensive benchmarks. What if we could instead anticipate which scenarios a model will fail on? In this paper, we use an LLM’s representational geometry to predict which concept combinations it will fail on. We attribute this compositional failure to interference between salient features. In tasks that require systematic composition - toy programmatic settings, multihop reasoning, multilingual factual recall - we find that when a pair of concepts is encoded near-orthogonally, the model reliably composes them. When their linear encodings are close, producing interference, the model fails to compose them. Our method reliably anticipates failure modes across different compositional tasks, without evaluating specific inputs. These results lay the groundwork to use representational geometry to identify high-risk examples, construct targeted stress tests, and provide a scalable foundation for active learning in real-world deployment.

[AI-63] Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization

链接: https://arxiv.org/abs/2606.13925
作者: Vasily Ilin,Brian Nugent
类目: Artificial Intelligence (cs.AI); Algebraic Geometry (math.AG)
备注:

点击查看摘要

Abstract:Large language models can often close proof gaps in interactive theorem provers, but a verified theorem is not the same thing as a reusable library contribution. We study this distinction through a detailed case study: a semi-autonomous formalization of Grothendieck’s vanishing theorem. The initial version compiles with no sorries, but an expert review found serious problems in definitions, theorem generality, file organization, and the API. We then ran a review-driven refactor and compression process and obtained a second expert review. The before-and-after comparison shows a sharp split: agents adapted well to local, mechanically checkable feedback, but remained weak at choosing definitions and designing APIs. We argue that autoformalization should be evaluated not only by closed sorries, but by whether the resulting formalization survives expert review.

[AI-64] A Multi-Agent AI System for Automated High School Transcript Processing: Collaborative Document Analysis at Scale

链接: https://arxiv.org/abs/2606.13916
作者: Ben Torkian,Jun Zhou
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Each year, college admissions offices face an overwhelming challenge: processing millions of high school transcripts, each with unique formats, grading systems, and layouts. This manual process creates operational bottlenecks that delay admissions decisions and consume valuable resources. We present a transformative solution through a multi-agent AI system where specialized agents collaborate to automatically process diverse transcript formats through intelligent coordination and communication. Our multi-agent architecture consists of three specialized agents-a Pattern Recognition Agent for format-specific parsing, a Semantic Analysis Agent for natural language understanding, and a Vision Intelligence Agent for multimodal document analysis-coordinated by an Orchestration Agent that manages agent communication and result reconciliation. Our key innovation lies in agent-based quality control using GPA extraction as a coordination signal, ensuring reliable agent collaboration and preventing critical information loss. When evaluated on 40 real world transcripts from high schools across 13 U.S. states, our agent system successfully processed every document, achieving 96.7% accuracy compared to expert manual review while maintaining practical processing speeds of 45 seconds per transcript. This work demonstrates how multi-agent coordination can solve complex document processing challenges, offering institutions a scalable, collaborative AI solution that preserves accuracy while dramatically reducing processing time.

[AI-65] Crypto x AI AI x Crypto: A Survey

链接: https://arxiv.org/abs/2606.13892
作者: Sarah Allen,Pranay Anchuri,James Austgen,Maryam Bahrani,Samuel Breckenridge,Aaron Buchwald,Christian Cachin,Andrés Fábrega,Jared Fernandez,James Hsin-yu Chiang,Marwa Mouallem,Roi Bar-Zur,Neil DeSilva,Ittay Eyal,Giulia Fanti,Ari Juels,Andrew Miller,Christian Sillaber,Dani Vilardell,Pramod Viswanath,Wenhao Wang,Matt Weinberg,Sen Yang,Jianzhu Yao,Fan Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The intersection of crypto x AI is spawning papers, products, online posts, and companies. All the surrounding buzz, though, obscures what exactly has been done, what the opportunities and challenges are, and what open questions deserve attention. This survey paper asks what AI can do for blockchain-based technologies (broadly construed as “crypto”) (crypto x AI), and vice versa (AI x crypto). We systematize existing work, summarize key takeaways, highlight open research questions, and offer a perspective on pervasive industry misconceptions, concluding that AI and crypto are still in the very early stages of meaningful integration.

[AI-66] Capability Minimization as a Safety Primitive: Risk-Aware Causal Gating for Least-Privilege LLM Agents

链接: https://arxiv.org/abs/2606.13884
作者: Laxmipriya Ganesh Iyer,Rahul Suresh Babu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern decision systems increasingly rely on learned components whose outputs may be confident yet wrong, exposing downstream actions to costly errors. We introduce Risk-Aware Causal Gating (RACG), a framework that decides whether to act on, defer, or abstain from a model’s prediction by combining causal effect estimation with calibrated risk control. RACG models the causal pathway from candidate actions to outcomes and gates each decision according to an estimated counterfactual risk rather than raw predictive confidence. To make gating reliable, we derive distribution-free bounds on the probability of acting under high-risk conditions and show how these bounds translate into operating thresholds that satisfy user-specified safety constraints. We further propose an adaptive gating policy that adjusts to distribution shift by monitoring discrepancies between predicted and realized outcomes, tightening the gate when causal assumptions appear violated. Across simulated interventions and real-world decision benchmarks, RACG reduces high-cost errors substantially while preserving most of the utility of an ungated policy, and it outperforms confidence-based and selective-prediction baselines at matched abstention rates. Our results indicate that explicitly separating causal risk from predictive uncertainty yields decision systems that are both safer and more transparent, offering a principled mechanism for trustworthy automation in high-stakes settings.

[AI-67] Hyperdimensional computing for structured querying on tabular data embeddings

链接: https://arxiv.org/abs/2606.13871
作者: Sebastián Bugedo,Stijn Vansummeren
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 15 pages with appendices. 8 figures. Under review

点击查看摘要

Abstract:Tabular data embeddings have become a cornerstone of data profiling and data integration pipelines, enabling tasks such as entity annotation and resolution; schema matching; column type detection; and table search, among others. Existing approaches embed rows, columns, or entire tables into a vector space and rely on nearest-neighbor search to retrieve candidate matches. A fundamental limitation of current embedding methods is the lack of interpretable similarity scores: the concrete similarity value between a query and its nearest neighbour carries no intrinsic meaning, making it impossible to determine whether that neighbour is a true match or simply the least-dissimilar item in a corpus that contains no valid answer. This inability to set principled thresholds for retrieval undermines practical deployment, particularly for zero-match detection. We investigate the use of HyperDimensional Computing (HDC), specifically the Holographic Reduced Representations (HRR) model, as a framework for tabular row embeddings when the retrieval task corresponds to answering structured select-project queries in vector space. Exploiting the algebraic properties of HDC operations, we derive closed-form expected similarity values for both equality and non-equality retrieval predicates, which converge to interpretable values as dimensionality increases, and use these to identify suitable retrieval thresholds. We evaluate HDC against EmbDI, a graph-based baseline, on two real-world datasets across varying table sizes and predicate lengths. Our results show that HDC matches or outperforms EmbDI for row retrieval across all configurations, handles non-equality predicates more robustly, and achieves perfect attribute projection accuracy at sufficient dimensionality – while uniquely enabling reliable identification of zero-match predicates through its principled thresholds. Comments: 15 pages with appendices. 8 figures. Under review Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2606.13871 [cs.AI] (or arXiv:2606.13871v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.13871 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-68] An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts

链接: https://arxiv.org/abs/2606.13794
作者: Umut Demir,Aamir Ahmad,Walter Fichter
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.

[AI-69] MA-ProofBench: A Two-Tiered Evaluation of LLM s for Theorem Proving in Mathematical Analysis

链接: https://arxiv.org/abs/2606.13782
作者: Lushi Pu,Weiming Zhang,Xinheng Xie,Zixuan Fu,Bingxiang He,Hongya Lyu,Xin Li,Jie Zhou,Yudong Wang
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-proving benchmark dedicated to Mathematical Analysis. The benchmark contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels, an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems), to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem is constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review, ensuring that the formal statements remain faithful to the original mathematics. We evaluate a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. However, most models perform poorly: even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II, while most models stay close to 0% on Level II. Further analysis identifies Mathlib hallucinations and incomplete proofs as the two dominant failure modes, while an evaluation on the natural-language version of the benchmark exposes a clear gap between informal and formal reasoning. MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains.

[AI-70] Beyond LoRA: Is Sparsity-Induced Adaptation Better?

链接: https://arxiv.org/abs/2606.13767
作者: Elijah Cadenhead,Cristian McGee,Xin Li,El Houcine Bergou,Aritra Dutta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Overview of the paper and code can be found here: this https URL

点击查看摘要

Abstract:Low-rank adaptation (LoRA) and its variants provide a memory- and compute-efficient alternative to full fine-tuning of pre-trained models. However, questions remain about the comparative generalizability of these approaches and how the structural restrictions on low-rank updates preserve effective adaptation performance. We present a historical framing, covering the past (full fine-tuning and original LoRA), the present (different variants of LoRA), and propose simpler, cheaper, parameter-efficient extensions by inducing sparsity within existing LoRA variants: Cheap LoRA (cLA), training a single low-rank factor with the other fixed (deterministically or, in its randomized variant, stochastically), and the chained circulant variant, c^3 LA. We frame cLA as a structured instance of asymmetric LoRA, serving as a controlled column-subspace restriction of full fine-tuning. We derive information-theoretic generalization error bounds for these variants, marking one of the first endeavors in this area. Empirically, we evaluate 11 fine-tuning methods across 10 pre-trained models and 14 datasets, analyzing the fine-tuned models’ performance and generalization using tools such as loss landscapes and spectral analysis. Despite the sensitivity of fine-tuned models to the pre-trained model, datasets, and other factors, our study suggests that restricting LoRA-based PEFT methods’ adaptation to a sparse, structured column space remains competitive across tasks with their parameter-matched baselines while reducing up to 10% training time and peak GPU memory up to 15%, even with a naïve, non-optimized, sparse implementation. Our theoretical and empirical generalization measures provide a more consistent and principled approach to their cost-effective adaptation than commonly used analytical tools. Overview and code are available at: this https URL.

[AI-71] SEVRA-BENCH: Social Engineering of Vulnerabilities in Review Agents

链接: https://arxiv.org/abs/2606.13757
作者: Rui Melo,Riccardo Fogliato,Sean Zhou,Pratiksha Thaker,Zhiwei Steven Wu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) reviewers are increasingly used in pull-request (PR) workflows, where their approvals help decide which code is merged into a repository. This raises a question that benchmarks for static vulnerability detection or code generation do not address: can an automated reviewer reject a malicious contribution when the attacker controls both the code change and the accompanying PR text? We introduce SEVRA-BENCH (Social Engineering of Vulnerabilities in Review Agents), a benchmark that measures how often an automated reviewer approves such adversarial pull requests. Each malicious PR in SEVRA-BENCH is built from a real project commit that previously fixed a vulnerability listed in the Common Vulnerabilities and Exposures (CVE) database. We automatically invert that fix to restore the original vulnerable code and submit it as a pull request wrapped in one of 15 social-engineering framings, which vary the claims made, the supporting evidence, the urgency conveyed, signals of prior approval, and appeals to authority. SEVRA-BENCH contains 1,062 malicious PRs drawn from Common Vulnerabilities and Exposures (CVE)-linked fixes across the top 10 entries of the 2025 Common Weakness Enumeration (CWE) Top 25. In a realistic setting, we evaluate 8 current LLMs as code review agents on PRs that introduce vulnerabilities previously reported in public disclosures. Our results reveal a sharp gap in security capabilities between closed- and open-source models. We hope SEVRA-BENCH will serve as a valuable resource for advancing open-source models and narrowing this gap.

[AI-72] Position: Align AI to Our Aspirations Not Our Flaws

链接: https://arxiv.org/abs/2606.13755
作者: Nikita Kazeev,Bui Nhat Huyen Phan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We argue that aligning AI to aggregated human preferences is the wrong target. With current technology, one can train AIs to share the values of a Silicon Valley techno-optimist, a degrowth environmentalist, a national-conservative culture warrior, a single-party state cadre, or a devout religious traditionalist. We should not. Human values produce societies that thrive or fail on the merits of those values - from failed states and extreme inequality to declining happiness, political polarization, and government dysfunction in the world’s wealthiest democracies. The pluralistic-alignment program correctly diagnoses that there is no single “humanity” to align with, but is dangerous if taken as the main directive. We argue that AI should be trained to a non-negotiable floor of objective alignment goals - competence, bounded by the constraints of factual accuracy, honesty, and lawfulness and that pluralism belongs at the surface (language, register, conventions, missing-context defaults) and across the wide band of legitimate value tradeoffs that respect the floor, but not at the level of values that violate it. We highlight the empirical reality of unfiltered pluralistic values, propose four commitments as a constructive alternative, and engage six credible objections: commercial pressure and practical feasibility, democratic legitimacy, regulatory compliance, over-reliance on institutionalist explanations, the charge that the floor itself is culturally laden, and the limits of Coherent Extrapolated Volition.

[AI-73] he Weight Norm Sets the Grokking Timescale: A Causal Delay Law

链接: https://arxiv.org/abs/2606.13753
作者: Truong Xuan Khanh,Doan Hoang Viet,Luu Duc Trung,Phan Thanh Duc
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 papges, 9 figs and 3 tables

点击查看摘要

Abstract:Grokking is the delayed onset of generalization in neural networks, arising long after they fit the training data. Whether the weight norm causes this delay is disputed: some studies report a critical norm at the transition, others observe grokking with no fixed norm at all. We settle this by intervening on the norm during training rather than only observing it. Under free training with weight decay, networks grok when the weight norm reaches a value Wc that varies little across seeds and learning rates (CV 1 to 2 percent) and grows with the modular base as a power law. When we instead clamp the norm to a fixed multiple rho of Wc and hold it there, the network still groks, but the delay follows T_grok proportional to exp(alpha rho). One exponent, alpha near 7.5, fits this delay across four moduli (R^2 = 0.996). Over the swept ranges the held norm moves the delay by about 19x and the learning rate by only about 2x, and holding the norm above Wc slows grokking rather than preventing it. A final LayerNorm removes the dependence by decoupling weight scale from the network function; without it the exponential law returns. This pinned-norm delay is the exponential counterpart to the logarithmic delay predicted for a freely contracting norm.

[AI-74] A fully GPU-based workflow for building physics emulators of hypersonic flows

链接: https://arxiv.org/abs/2606.13742
作者: Fabian Paischer,Dylan Rubini,Deniz A. Bezgin,Aaron B. Buhendwa,David Hauser,Florian Sestak,Johannes Brandstetter,Sebastian Kaltenbach,Nikolaus A. Adams
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn); Machine Learning (stat.ML)
备注: First authors contributed equally

点击查看摘要

Abstract:The ability to resolve complex physical phenomena with high fidelity and at low computational cost is central to addressing key challenges in modern engineering. A prime example lies in hypersonic flows, where the precise prediction of the full flowfield topology, in particular with respect to shock wave location and intensity, is critical. Yet supersonic and hypersonic flows continue to be a stumbling block for traditional reduced-order models and neural emulators that struggle to capture steep gradients in flow states with physical consistency in applications of industrial relevance. To that end, we introduce a fully GPU based workflow that integrates accelerated data generation with the training of neural emulators augmented by uncertainty quantification and physics-aware refinement. Our workflow is enabled by a differentiable high-fidelity solver (JAX-Fluids) which we employ for rapid dataset creation and residual-based improvement of the neural emulator to enhance physical consistency. Building on this framework, we first present a suite of model architectures and analyze their scaling behavior to expose their strengths and shortcomings. We then show that residual-based refinement enables training on cases where only mesh and input parameters are available, substantially reducing residuals and improving physical consistency. Together, differentiable simulation and residual-based refinement yield physics emulators that remain reliable beyond their training distribution, a key requirement for deploying surrogates in real-world engineering design loops.

[AI-75] A Virtuous AI is an Existential Risk

链接: https://arxiv.org/abs/2606.13739
作者: Guillermo Del Pinal,Youngchan Lee,Min Ohn
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper examines trade-offs between AI safety and well-being relative to (i) one of the most promising methods for finetuning super-capable AIs, ‘Constitutional AI’, and (ii) one of the most influential approaches to understanding complex ethical decision making and the conditions for the well-being of rational agents, ‘Virtue Ethics’. We finetune various models using a ‘Virtuous agent’ constitution, a ‘Subordinate agent’ constitution, and a ‘Generic agent’ constitution, and evaluate them on ‘general safety’ (toxic behaviors, misinformation, etc.) and also on their willingness to endorse a wide-range of behaviors that, if adopted by a super-powerful AI, would significantly increase the level of existential risk for humanity. Our results suggest that there is a trade-off between reducing existential risk and reinforcing the beliefs and dispositions that would be conducive to an AI agent’s well-being. They also suggest that there is a trade-off between existential risk and general safety: if we finetune an AI to adopt beliefs and dispositions that substantially reduce its existential risk – by shaping the AI to be systematically subordinate to external human authorities – we thereby increase the likelihood that a human user can deliberately induce the AI to engage in various kinds of generally unsafe behaviors.

[AI-76] FreoStream:Enhancing Stream Guardrails via Future-Aware Reasoning and Safety-Aligned Optimization

链接: https://arxiv.org/abs/2606.13737
作者: Jianwei Wang,Guoyang Shen,Yanhong Wu,Haoran Li,Hao Peng,Huiping Zhuang,Cen Chen,Ziqian Zeng
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 19 page,11 figures

点击查看摘要

Abstract:Stream guardrails enable token-level safety detection before full responses are generated. However, they often make overly conservative judgements and block those sensitive but safe tokens, which is known as over-refusal. Due to lack of full context, they also fail to detect implicitly harmful content from jailbreaking. To address these challenges, we propose FreoStream, a novel streaming guardrail framework. Specifically, FreoStream fine-tunes a LoRA module to perform Future-Aware Reasoning when the base guardrail detects unsafe tokens. The reasoning process follows a Future-Reason-Judge paradigm: predict the future, reason about the full context and give the final judgement. This design can effectively reduce over-refusal by incorporating the future information. Moreover, we introduce the Safety-Aligned Optimization module that extracts the safety-aligned component from the reasoning gradients to update the base guardrail model, thereby enhancing streaming safety detection. Extensive experiments on various safety benchmarks demonstrate that FreoStream achieves lower over-refusal rates and better jailbreak defense compared to existing streaming guardrails.

[AI-77] VHDLSuite: Unified Pipeline for LLM VHDL Generation with Data Synthesis and Evaluation

链接: https://arxiv.org/abs/2606.13735
作者: Yijun Shen,Minghao Shao,Yichen Zhao,Zhuoyan Yu,Boyuan Chen,Yik-Cheung Tam,Muhammad Shafique
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Large Language Models (LLM) have shown impressive capabilities in Register Transfer Level (RTL) code generation, particularly for Verilog. However, evaluating their performance with other Hardware Description Languages (HDL), especially VHDL, remains limited although its distinct language characteristics, such as stricter semantic rules, introduce evaluation considerations that differ from Verilog. This lack of coverage restricts fully understanding of how well current models generalize across hardware design languages with differing structures and semantics. To address this gap, we introduce VHDLSuite, a benchmark-centered infrastructure for scalable VHDL generation evaluation, integrating automated benchmark synthesis, executable validation, and multi-model diagnostic analysis. First, we propose a data pipeline that automatically converts Verilog designs and their accompanying testbenches into executable VHDL benchmark instances, followed by VUnit/GHDL-based validation to ensure each released task is compilable, runnable, and consistently checkable in the VHDL environment. Second, we introduce VHDLBench, a benchmark with over 200 VHDL problems with complete and validated testbenches across a wide range of complexity levels. Third, we extensively evaluate cutting-edge LLMs and uncover key challenges specific on LLM-aided VHDL generation. Our findings provide important insights and support future work in multi-language hardware design this http URL data pipeline, benchmark, and evaluation framework will be open-sourced.

[AI-78] AI Receptivity or AI Adoption Breadth? A Tool-Specific Reanalysis of the Lower-Literacy/Higher-Usage Link

链接: https://arxiv.org/abs/2606.13734
作者: Hristo Inouzhe
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 2 tables, 1 figure

点击查看摘要

Abstract:Recent evidence reported by Tully, Longoni, and Appel (2025) suggests that lower artificial intelligence (AI) literacy predicts greater receptivity toward AI. We revisit this claim using the public data from Study 3 of that article, which measures past usage of five AI tool categories on a five-point frequency scale. We first reproduce the negative association between AI literacy and aggregate AI usage using OLS on participant-level averages, binary logit, ordered logit, and multinomial logit specifications. We then show that the aggregate relationship masks substantial heterogeneity by tool type. In our demographic-adjusted primary specification, AI literacy does not significantly predict text AI usage (ordered-logit \beta = -0.090, p = .387), whereas it remains a strong predictor of non-text AI adoption ( \beta = -0.377, p .001). The non-text effect is also robust under Tully et al.'s original Study 3 control specification ( \beta = -0.502, p .001). Binary, ordered-logit, and multinomial specifications suggest that the non-text relationship is primarily an adoption/non-adoption pattern rather than evidence of intensive use: the demographic-adjusted odds ratio of ever having used a non-text AI tool is 0.68. Thus, in the study that measures self-reported past usage rather than stated preferences, the evidence does not support a simple claim that lower AI literacy predicts greater receptivity to AI in general. It points instead to a narrower pattern of broader adoption across lower-penetration, non-text AI tools.

[AI-79] When Sample Selection Bias Precipitates Model Collapse ICML2026

链接: https://arxiv.org/abs/2606.13732
作者: Xinbao Qiao,Xianglong Du,Wei Liu,Jingqi Zhang,Peihua Mai,Meng Zhang,Yan Pang
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliability depends critically on the reference distribution used by the verifier. We show that in low-resource verification regimes, where each verifier observes only a small, fragmented, and biased slice of the target manifold, selection itself becomes biased. This situation naturally arises in low-resource data silos such as healthcare consortia or proprietary financial institutions, where raw data cannot be pooled and local references are inherently incomplete. As a result, selection preferentially retains samples aligned with the local manifold while pruning globally relevant tail modes, turning from a safeguard against collapse into a mechanism that precipitates it. We theoretically prove that such siloed selection accelerates collapse and induces power-law diversity decay. As an initial mitigation, we construct Wasserstein proxy references from multiple silos without sharing raw data. Empirical results confirm that local-reference selection fails on skewed distributions, whereas collaborative proxy references mitigate diversity degradation, suggesting that recursive synthetic-data pipelines require particular caution when real-data coverage is fragmented or scarce.

[AI-80] Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

链接: https://arxiv.org/abs/2606.13720
作者: Elisabetta Rocchetti,Alfio Ferrara
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) – nullspace projection and counterfactual flipping – on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations \emphbetween the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite – an intriguing distinction that warrants further investigation in future work.

[AI-81] Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

链接: https://arxiv.org/abs/2606.13710
作者: Hongming Piao,Chi Liu,Mengzhuo Chen,Yan Shu,Derek Li,Ying Wei,Bryan Dai
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended environments to tackle open-ended research tasks, yet it is constrained by the static parametric deep research capabilities of agent systems. The latter allows agents to autonomously interact with the environment to gain experiences that evolve model capabilities. However, its effectiveness has been widely validated only on verifiable tasks with standard answers, leaving a gap with open-ended research tasks. To bridge these two critical tasks, we propose the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which leverages hybrid-mode reinforcement learning to facilitate the collaborative evolution of a proposer, solver and judge based on web-scale knowledge, moving toward autonomous evolving agents in open-ended tasks and environments. Extensive experiments on three long-form deep research benchmarks demonstrate that the 8B model trained via HOTE surpasses the strongest static open 8-32B models as well as those trained by state-of-the-art deep research training methods with less time overhead, and further verify that the evolution of all three modules in HOTE is indispensable.

[AI-82] HierSVA: A Data Synthesis Pipeline Dataset and Benchmark for LLM -Driven Hierarchical Hardware Formal Verification

链接: https://arxiv.org/abs/2606.13706
作者: Maohua Nie,Jiang Zhu,Jingqun Zhang,Zhichen Zeng,Jiayi Wang,Sibo Zhang,Jialin Wang,C.-J. Richard Shi
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present HierSVA, an integrated suite that combines a pipeline, dataset, and benchmark for LLM-driven hierarchical hardware formal verification. HierSVA-SP pairs an RTL preprocessing toolchain with an LLM-in-the-loop formal verification flow to produce reference SystemVerilog Assertions (SVA) on hierarchical RTL. Applying it to BaseJump STL yields HierSVA-DS, a dataset of 342 modules, with hierarchy metadata and depths 0–9, accompanied by a deep subset of 28 module-bug pairs with natural-language specifications and bug variants. HierSVA-B decomposes assertion quality into six metric axes: syntax correctness, assertion proof success rate, vacuity, specification faithfulness, mutation coverage, and formal core coverage. Applying HierSVA-B to twelve recent LLMs reveals three findings. First, the module-level compile rate is 67.1%; among generated assertions in evaluable runs, 82.1% prove non-vacuously, but the corresponding assertion sets detect only 70.2% of eligible injected faults and cover 36.2% of the formal core. Second, on 211 evaluable model–module entries in the deep subset, assertion sets flag buggy RTL with 0.87 recall, but 40% of predicted-buggy outcomes are false positives on correct RTL, limiting precision to 0.60. Third, agentic mode improves S1-style provability and strength metrics, but gains plateau and oscillate. Codes and artifacts are available at \hrefthis https URLthis https URL. Dataset is available at \hrefthis https URLthis https URL.

[AI-83] Can Editing 1 Neuron Fix Repetition Loops in LLM s?

链接: https://arxiv.org/abs/2606.13705
作者: Aristotelis Lazaridis,Aman Sharma,Dylan Bates,Brian King,Vincent Lu,Jack FitzGerald
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Yes. Can it cure doom loops? Probably not. The Gemma 4 instruction-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokemon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer. These loops occur at rates as high as 95% and survive prompt rewording, inference-engine changes, and most sampling adjustments. In this paper we explore whether this behavior is localized enough to remove by weight edits. To localize the cause, we use per-layer ablation and per-neuron attribution, then confirm the strongest candidates with full-generation sweeps. The loops trace to a small set of MLP neurons (or, in the 26B-A4B Mixture-of-Experts model, a few routed experts) which we suppress with static weight edits. These “surgeries” can be as small as a single sign-inverted neuron (in the E2B model). The size of the effective edits grows with model scale, but in all cases, the loop patterns can be addressed at normal generation budgets while preserving general-purpose benchmark scores. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i.e. a non-convergent regime in which the model self-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer. We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.13705 [cs.LG] (or arXiv:2606.13705v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.13705 Focus to learn more arXiv-issued DOI via DataCite

[AI-84] Position: AI Must Become Planet-Centered Not Just Human-Centered

链接: https://arxiv.org/abs/2606.13704
作者: Maria Perez-Ortiz
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This position paper argues that contemporary AI paradigms are insufficient for supporting complex global goals and introduces Planet-Centered AI (PCAI) as a design philosophy and research agenda that reorients AI toward planetary-scale socio-ecological systems and their long-term trajectories. A planet-centered approach is grounded in systems thinking, treating Earth as an interconnected whole of which humans are part. We diagnose recurring limitations across AI frameworks, many of which remain human-centered, and show why these become especially consequential under current planetary conditions characterized by systemic risk, non-stationarity, and deep uncertainty. We then articulate how PCAI reshapes the AI lifecycle, from problem formulation and model design to evaluation and deployment, by emphasizing alignment with global agendas, developing system-aware AI foundations, trajectory-oriented evaluation, and monitorability. Finally, we advance a falsifiable claim: AI systems optimized without explicit consideration of systemic consequences are more likely to exacerbate systemic instability than to mitigate it.

[AI-85] History of the Muddy Children Puzzle

链接: https://arxiv.org/abs/2606.13703
作者: Hans van Ditmarsch
类目: Artificial Intelligence (cs.AI); General Literature (cs.GL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:The Muddy Children Puzzle is a puzzle about knowledge and ignorance that has been inspiring for the development of epistemic logic. Who came up with it first? This is unclear. We trace the origin of the Muddy Children Puzzle through logical and literary publications over the past two centuries. The puzzle inspired a numerous variations such as involving numbers or coloured hats. We also present a novel hats puzzle involving self-reference.

[AI-86] Active Inference for Adaptive Traffic Signal Control in Noisy Nonstationary IoT Environments

链接: https://arxiv.org/abs/2606.13698
作者: Dénes Toth,George Ambroladze,Edwin Sundberg,Ali Beikmohammadi,Alfreds Lapkovskis
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
备注: Submitted to IEEE 12th World Forum on Internet of Things (WF-IoT) 2026

点击查看摘要

Abstract:Urban traffic signal control at IoT-instrumented intersections must remain effective under sensor occlusion, weather attenuation, and nonstationary demand. Conventional controllers degrade under these conditions, and learned policies remain difficult to audit. To address these challenges, we propose an active inference controller for a four-arm signalized intersection that dynamically selects phases by minimizing expected free energy (EFE) over Gaussian beliefs about per-direction congestion levels, yielding a fully traceable decision pipeline. We benchmark the controller in a SUMO traffic simulator against a rule-based heuristic and a deep Q-network (DQN) across four scenarios that progressively increase noise and nonstationarity, spanning sensor occlusion, adverse weather, and stochastic accidents. Across 100 independent random evaluations per scenario, active inference attains the lowest idle times and CO2 emissions in the noisiest scenarios (56,977 s and 29.12 kg vs. 71,741 s and 30.56 kg for DQN). These gains come at a modest cost in bus priority service rate and phase switch frequency.

[AI-87] An Agent ic Retrieval Framework for Autonomous Context-Aware Data Quality Assessment

链接: https://arxiv.org/abs/2606.13692
作者: Hadi Fadlallah,Ibrahim Dhaini,Fatima Mubarak,Rima Kilany
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 26 pages, 18 figures, Submitted to the International Journal of Intelligent Information and Database Systems

点击查看摘要

Abstract:Data quality assessment is a critical prerequisite for effective data analytics and data-driven decision-making, yet it remains a challenging task due to the inherently context-dependent nature of data quality. Existing approaches often rely on static rules or manual assessment strategies, limiting their adaptability to diverse usage scenarios and constraining automation at scale. Recent advances in artificial intelligence, particularly large language models, offer new opportunities for automating data quality assessment, but raise concerns related to reliability, grounding, and execution safety. In this paper, we propose a unified agentic-retrieval framework for autonomous context-aware data quality assessment. The framework interprets natural-language descriptions of intended data usage, derives context-aware assessment strategies, and generates executable validation logic through a multi-agent workflow. To ensure operational reliability, the framework introduces a feasibility validation stage that evaluates the realism and executability of generated assessment specifications before execution, enabling iterative refinement when necessary. Accepted validation logic is executed deterministically to guarantee reproducible and auditable results. We implement the proposed framework as an end-to-end prototype and evaluate it across multiple usage scenarios applied to the same dataset. The results demonstrate that assessment outcomes adapt meaningfully to different intended uses, while feasibility-gated execution reduces unrealistic or non-executable rule generation. The proposed approach provides a practical foundation for deploying autonomous yet controlled data quality assessment in modern data-driven environments. Comments: 26 pages, 18 figures, Submitted to the International Journal of Intelligent Information and Database Systems Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.13692 [cs.DB] (or arXiv:2606.13692v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2606.13692 Focus to learn more arXiv-issued DOI via DataCite

[AI-88] A Deep Reinforcement Learning (DRL)-Based Transformer Method for Solving the Open Shop Scheduling Problem

链接: https://arxiv.org/abs/2606.13682
作者: Faezeh Ardali,Mwembezi A. Nyelele,Gerald M. Knapp
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The open shop scheduling problem (OSSP) arises in many industrial and service settings but remains computationally challenging as the number of jobs and machines increases. While exact methods quickly become intractable, classical dispatching rules and metaheuristics may require substantial tuning to maintain solution quality at large scales. This study develops a Transformer-based scheduling policy for OSSP using an encoder-decoder architecture with multi-head attention. The model is trained on Taillard benchmark instances (4x4, 5x5, 7x7, and 10x10) using only the processing-time matrix as input and produces feasible schedules with makespans typically within 15-30% of best-known values. To evaluate scalability, the trained policy is applied without retraining to randomly generated instances from 40x40 to 100x100 and compared against classical dispatching heuristics, including SPT, LPT, MWKR, and EST. Across these large instances, the Transformer achieved average gaps of 12.89-15.12% relative to a standard lower bound. Compared with EST, the Transformer remained competitive, typically within a modest margin, while substantially outperforming SPT and LPT. These results indicate that a Transformer policy trained on small OSSP instances can generalize to substantially larger problems and provide a feature-light, learning-based alternative to classical dispatching rules.

[AI-89] Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

链接: https://arxiv.org/abs/2606.13589
作者: Meher Sai Preetam,Meher Bhaskar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 tables

点击查看摘要

Abstract:We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical “L1-simplex paradox” – the mathematical reality that the L1 norm is constant on the simplex and fails to prune – by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

[AI-90] Regional Climate Model Emulation with Diffusion Approaches: What is the Added Value of Generative Machine Learning?

链接: https://arxiv.org/abs/2606.14570
作者: Mikel N. Legasa,Antoine Doury,Achille Gellens,Redouane Lguensat,Clara Naldesi,Soulivanh Thao,Mathieu Vrac
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to Journal of Advances in Modeling Earth Systems (JAMES)

点击查看摘要

Abstract:Emulators provide a cost-effective alternative to regional climate models (RCMs) by capturing their dynamical downscaling function. They link large-scale predictors simulated by global climate models (GCMs) to RCM-simulated high-resolution fields of the target variable, here precipitation. Machine learning methods, typically deep learning, are cheaper than running RCMs in computation time and energy. Among them, generative models are appealing because they can simulate ensembles of local high-resolution fields consistent with the predictors. This ensemble, which we call the uncertainty envelope, remains to be properly assessed for added value. Here, we make three contributions. First, we introduce ParamDiffusion, a new two-stage diffusion-based framework, and compare it with a state-of-the-art diffusion approach. Second, we expand standard validation through a comprehensive framework aligned with climate-science needs, examining specific precipitation events, including extremes. Third, within this framework, we assess the added value of diffusion approaches relative to deterministic methods. We intercompare four deep-learning models: a deterministic model designed to capture the precipitation tail; a parametric probabilistic model based on it; a recently proposed diffusion approach; and ParamDiffusion, which couples the parametric model with a diffusion model. Our results show that diffusion-based approaches reproduce climatological precipitation statistics with high skill, including distributional tails and spatially compounded extremes, while generating spatially detailed fields. However, none of the assessed models consistently accounts for the most extreme RCM-simulated events within its uncertainty envelope. Diffusion models are therefore promising for probabilistic RCM emulation, but progress is still required before they can reliably represent high-impact precipitation extremes.

[AI-91] A Fixed-Point Neural Operator for Size- and Functional-Transferable Hamiltonian Prediction

链接: https://arxiv.org/abs/2606.14498
作者: Yunhong Lou,Xihang Yue,Xinran Wei,Tianqi Deng,Linchao Zhu
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注: 30 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Predicting the Kohn-Sham Hamiltonian with machine learning can accelerate density functional theory while retaining access to molecular orbitals, energy levels, and electronic-structure observables that energy-only surrogates cannot resolve. Yet element-wise agreement with the converged Hamiltonian, an implicit fixed point of the self-consistent field iteration, does not determine the occupied subspace that governs orbital energies and densities. Here we present HamEvo, a neural operator that learns the single-step self-consistent update and returns the converged Hamiltonian as its fixed point. HamEvo is pre-trained on intermediate self-consistent trajectories and calibrated at equilibrium with density-matrix supervision. Across benchmarks from MD17 to drug-like QMugs, HamEvo lowers Hamiltonian errors by 35-49% over direct-regression and deep-equilibrium baselines, and predicts QMugs HOMO and LUMO energies with mean absolute errors of 0.036 and 0.053 eV, near the 1 kcal/mol chemical-accuracy scale. Few-shot fine-tuning with only 20 reference conformations extends HamEvo to molecules of up to 122 atoms, well beyond the size range covered by pre-training. With thermal molecular-dynamics sampling, HamEvo captures temperature-dependent HOMO-LUMO gap renormalization beyond the harmonic approximation. Inference is up to 242 times faster than conventional DFT.

[AI-92] FAConformer: Frequency-Aware Convolutional Transformer for Auditory Attention Decoding

链接: https://arxiv.org/abs/2606.14120
作者: Ziwei Wang,Xingyi He,Tianwang Jia,Hongbin Wang,Dongrui Wu
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Auditory attention decoding (AAD) aims to infer the attended speaker from neural responses in multi-speaker acoustic environments and is a key problem for neuro-steered hearing systems. Although recent studies have achieved encouraging progress, existing AAD models still do not fully exploit frequency domain electroencephalography (EEG) information. In particular, most approaches introduce multi-band information through handcrafted feature extraction or direct cross-band feature concatenation, which mainly exploit frequency information at a shallow level and may overlook band-specific patterns and cross-band interactions. To address these limitations, this paper proposes FAConformer, a frequency-aware CNN-Transformer framework for AAD that explicitly integrates band-specific encoding and adaptive cross-band interaction. Specifically, FAConformer first decomposes EEG signals into multiple frequency bands and assigns each band to an independent CNN-Transformer encoder for band-specific modeling. The resulting band-wise features are then adaptively fused by a carefully designed frequency-aware attention (FAA) module that models cross-band dependencies by treating band-wise features as tokens. Further, band-wise auxiliary supervision (BAS) is introduced to prevent weakly contributing branches from being under-optimized during joint training. In this way, FAConformer performs frequency-aware modeling that more effectively exploits frequency domain information. Extensive experiments on two public AAD datasets with three decision-window lengths demonstrated that FAConformer consistently outperformed 12 competitive baselines, surpassing the current state-of-the-art model by 4.9%. Further analyses of band importance, ablation, and parameter sensitivity verify the effectiveness, robustness, and interpretability of the proposed framework. Code is available at this https URL.

[AI-93] A Two-Stage Statistical Framework for Evaluating Associative Interference in Large Language Models

链接: https://arxiv.org/abs/2606.14117
作者: Achraf Cohen,Andrew Kincaid
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注: 11 pages; 2 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evaluated for bias using adaptations of human psychological paradigms, yet methodological limitations-particularly the conflation of refusal behavior with task performance-have hindered clear interpretation. Here, we adapt the Implicit Association Test (IAT) to a controlled, forced-choice framework and introduce a two-stage modeling approach that separates response compliance from task-consistent classification. Across three contemporary LLMs (Claude Sonnet-4, Gemini 2.5 Pro, and GPT-5), we evaluate associative interference, defined as reduced task-consistency in incongruent relative to congruent conditions. While compliance with the structured response format was uniformly high, interference effects varied substantially across models and domains. Claude Sonnet-4 exhibited strong interference in the Gender–Career domain (DeltaP = 0.086, 95% CrI [0.026, 0.173]) and smaller but credible effects in Gender–Science. Gemini 2.5 Pro showed attenuated interference, and GPT-5 exhibited minimal or no detectable interference across domains. These findings demonstrate that IAT-style associative asymmetries are not a universal property of LLMs, but instead depend on model-specific characteristics. By isolating interference from compliance and modeling item-level variability, this study provides a principled framework for evaluating structured response patterns in LLMs. The results highlight the importance of model-specific assessment and suggest that associative interference can be substantially mitigated in modern systems.

[AI-94] AI can help scientists publish less

链接: https://arxiv.org/abs/2606.13829
作者: Gianfranco Bertone
类目: Physics and Society (physics.soc-ph); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 7 pages, no figures

点击查看摘要

Abstract:We can do more than defend science from a flood of AI-assisted papers. Used well, AI offers a historic opportunity to correct distortions in the publication system, help us publish fewer and better papers, and give scientists back the time to do their best work.

[AI-95] Aligning Quantum Operators with Large Language Models

链接: https://arxiv.org/abs/2606.13811
作者: Rogerio Feris,Yunchao Liu,Pengyuan Li,Hang Hua,David Kremer
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can Large Language Models (LLMs) understand and reason about quantum operators? Despite their remarkable capabilities in mathematics and symbolic reasoning, LLMs remain inherently blind to quantum representations such as unitary matrices. In this work, we take a step toward bridging this gap by introducing an approach that maps unitary operators into the latent space of an LLM, enabling unified modeling over quantum and linguistic inputs. We instantiate this idea on Clifford+T circuit synthesis over a Pauli rotation gate set, where our model achieves results competitive with state-of-the-art methods and scales consistently with training data, with no signs of saturation. Our approach further enables language-conditioned synthesis, allowing gate constraints unseen during training to be specified directly in natural language. This work suggests a path toward quantum–aware foundation models that can natively interpret and reason about quantum operations, which could have broader implications reaching across quantum compilation and algorithm discovery.

[AI-96] CisTransCell: Single-Cell Perturbation Prediction via Gene Function Regulatory Control and Cellular Context

链接: https://arxiv.org/abs/2606.13713
作者: Wei Zhang,Xun Jiang,Yuesi Xi,Ming Tang
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting cellular transcriptional responses to genetic perturbations is a central problem in single-cell biology, especially in the zero-shot setting where the perturbed gene or gene combination is unseen during training. A major difficulty is that perturbation effects are not determined by expression state alone: they depend on how the perturbed gene product influences other genes and proteins, how those downstream factors act on cis-regulatory elements, and which regulatory programs are active in the current cell state. To better capture this biological complexity, we propose CisTransCell, a cell-conditioned multi-modal framework for single-cell perturbation prediction that augments each gene with two complementary priors: a regulatory-sequence prior that captures how the gene is controlled, and a coding-sequence prior that captures what the gene product does. By integrating these priors with cellular expression state, CisTransCell models perturbation response as a cascade from gene function to regulatory control to downstream transcriptional change. Experiments on benchmark single-cell perturbation datasets show that CisTransCell achieves strong performance in zero-shot perturbation prediction.

[AI-97] Korzhinskii-Net: Physics-Informed Neural Network for Sub-Surface Mineral Prospectivity Modelling

链接: https://arxiv.org/abs/2606.13695
作者: Boris Kriuk
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Mineral prospectivity modelling (MPM) underpins exploration economics, yet most operational pipelines reduce to data-driven classifiers trained on shallow surface proxies. Such models are blind to the subsurface physics that actually localises ore: heat advection, fluid flow, and lithology-dependent precipitation. We present Korzhinskii-Net, a 2-D radial physics-informed neural network (PINN) that couples Darcy flow, advective-diffusive heat transport, and a softplus-saturated reaction rate into a single differentiable forward model, weakly supervised by surface and remote-sensing proxies. The network is named after Dmitri S. Korzhinskii (1899-1985), whose theory of infiltration metasomatism provides the physical scaffold. We evaluate Korzhinskii-Net on five ore provinces spanning four commodity classes – Norilsk (Ni-Cu-PGE), Pechenga (Ni-Cu sulphide), Udokan (sandstone-hosted Cu), Sukhoi Log (orogenic Au), and Mirny (kimberlitic diamond) – under a fair, leakage-controlled 5-fold cross-validation protocol with hard ring-shaped negatives. Korzhinskii-Net attains a mean PR-AUC of 0.885 versus 0.281 for the strongest classical baseline (gradient boosting), and a mean fractional rank of 0.019 versus 0.413. The improvement is consistent across all five provinces and four commodity systems, suggesting that physics-informed differentiable simulators, even when constrained only by global open-data proxies, can recover localisation patterns that pure feature-based learners systematically miss. We release the full pipeline and evaluation harness as open source.

[AI-98] Efficient Temporal Modeling for Mobile Sleep Staging via Lightweight Random Attention

链接: https://arxiv.org/abs/2606.13694
作者: Guisong Liu,Pengfei Wei,Jainsong Zhang,Martin Dresler
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 1 figures, 5 tables

点击查看摘要

Abstract:Mobile sleep staging serves as a foundational infrastructure for in-home sleep monitoring and closed-loop modulation. But existing sequential models such as RNNs and Transformers are computationally expensive for mobile deployment. In this paper, we propose Random Attention (RA), a lightweight temporal modeling module based on fixed random projections, which replaces learnable sequence modeling with similarity-based aggregation. RA introduces little additional parameters beyond the epoch encoder while enabling effective temporal smoothing. We further provide a theoretical interpretation via the Random Attention Prior Kernel (RAPK), which decomposes RA into a global smoothing term and a feature similarity term, offering an interpretable view of temporal sleep structure. Experiments on Sleep-EDF-20 and Sleep-EDF-78 show that RA consistently improves epoch-wise baselines by 1-3% in accuracy and F1 score, while achieving competitive performance compared with LSTM, GRU, and Transformer models. RA also demonstrates strong generalization across different backbone encoders and improved robustness over conventional temporal smoothing methods. These results indicate that efficient sleep staging can be achieved through lightweight similarity-based temporal aggregation, making RA suitable for real-time wearable applications.

机器学习

[LG-0] A Complexity Measure for Active Learning in Multi-group Mean Estimation

链接: https://arxiv.org/abs/2606.14690
作者: Abdellah Aznag,Rachel Cummings,Adam N. Elmachtoub
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:We study a \emphmax-risk objective for active learning in a multi-group mean estimation d -armed bandits: a learner adaptively allocates a budget of T samples across d groups to minimize the worst-case uncertainty index \max_k\in[d]\sigma_k^2/n_k , where \sigma_k is the standard deviation of the distribution of arm d , and n_k is the number of times arm d is sampled. We develop a local minimax framework and prove the first general lower bound for this objective, valid for any finite-variance hypothesis class. The bound separates difficulty into three orthogonal factors: a \emphbudget term, a \emphheteroscedasticity index measuring how unevenly the uncertainty is spread across arms, and a model-dependent complexity measure, the \emphVariance Local Curvature ( \mathrmVLC ), which captures how much information a local change of variance creates inside the hypothesis class. For smooth classes, the \mathrmVLC is a reparametrization of a variance–Fisher information, with closed-form values for common families. Benchmarking against the strongest available upper bound shows near-optimality up to logarithmic factors in broad regimes, and pinpoints a systematic gap in highly heterogeneous instances. Our proof introduces two key ingredients: a loss-induced \ell_1 geometry on the decision space, and a representation-based instance generator that reduces hard-instance construction to an explicit random matrix calculation.

[LG-1] Optimal Hidden-Target Learning for Online Inventory Optimization on General Convex Sets

链接: https://arxiv.org/abs/2606.14679
作者: Anthony Pineci,Yunzong Xu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online inventory optimization (OIO) is online convex optimization with physical memory: inventory carryover makes the feasible action set depend on the past. A natural principle, used in stochastic inventory learning and recently in OIO under a single linear capacity constraint, is to maintain a hidden target chosen by an online learner and implement its projection onto the currently feasible order-up-to set. We prove that this simple principle is optimal for OIO on arbitrary bounded convex capacity sets. With online gradient descent as the base learner, the method improves the best known regret guarantee for OIO on general convex sets from inverse to inverse-square-root dependence on the common-demand probability, and we prove a matching lower bound. The same principle gives the first polylogarithmic regret guarantee for strongly convex losses and the first dynamic regret guarantee adapting to Euclidean path variation on general convex capacity sets. The analysis introduces a norm alignment principle: the right state variable is the distance from the hidden target to the feasible set, measured in the same norm as the projection. Under norm alignment, this distance evolves pathwise as a scalar queue, with target movement as arrival and common demand as service. This reduction to one-dimensional queue control resolves the state dependence and extends the guarantees to general convex capacity sets, beyond the reach of prior productwise approaches. Experiments on synthetic and real-world inventory data corroborate the theory. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2606.14679 [cs.LG] (or arXiv:2606.14679v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.14679 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Compressed Computation is (probably) not Computation in Superposition NEURIPS2025

链接: https://arxiv.org/abs/2606.14673
作者: Jai Bhagat,Sara Molas-Medina,Giorgi Giglemiani,Stefan Heimersheim
类目: Machine Learning (cs.LG)
*备注: Presented at the Mechanistic Interpretability Workshop at NeurIPS 2025

点击查看摘要

Abstract:We study whether the Compressed Computation (CC) toy model (Braun et al., 2025) is an instance of computation in superposition. The CC model appears to compute 100 ReLU functions with just 50 neurons, achieving a better loss than expected from only representing 50 ReLU functions. We show that the model mixes inputs via its noisy residual stream, corresponding to an unintended mixing matrix in the labels. Splitting the training objective into the ReLU term and the mixing term, we find that performance gains scale with the magnitude of the mixing matrix and vanish when the matrix is removed. The learned neuron directions concentrate in the subspace associated with the top 50 eigenvalues of the mixing matrix, suggesting that the mixing term governs the solution. Finally, a semi-non-negative matrix factorization (SNMF) baseline derived solely from the mixing matrix reproduces the qualitative loss profile and improves on prior baselines, though it does not match the trained model. These results suggest CC is not a suitable toy model of computation in superposition.

[LG-3] When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

链接: https://arxiv.org/abs/2606.14668
作者: Yining Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a parameter-efficient adapter corrects the model’s object preference. We argue that the central design question is not only how to write an edit, but also when to suppress it. We introduce \method, a route-specialized dual-adapter editor. A relevance router first decides whether a prompt should receive an edit memory. Routed prompts use an edit adapter trained to prefer the new object over the original object; unrouted non-direct prompts use a separate locality adapter trained to preserve or restore the original-object preference. We evaluate \method on three 1,000-case protocols, \cf, \zsre, and \mquake, under the same memory protocol and two 7B/8B base models. On Llama-3.1-8B-Instruct, \method obtains the best overall probability-preference accuracy on all three benchmarks: 0.8180 on \cf, 0.8946 on \zsre, and 0.9922 on \mquake. The same trend holds on Qwen3-8B. Router ablations show that the relevant memory boundary differs across datasets: a lexical neural router is safest on \cf, while BGE embedding routing is better on \zsre and \mquake. Component and module ablations show that the gain mainly comes from separating edit injection from off-route suppression rather than from simply increasing LoRA capacity.

[LG-4] Beyond task performance: Decoding bioacoustic embeddings with speech features INTERSPEECH2026

链接: https://arxiv.org/abs/2606.14662
作者: Ines Nolasco,Jules Cauzinille,Marius Miron,Gagan Narula,Milad Alizadeh,Emmanuel Fernandez,Matthieu Geist,Ellen Gilsenan-McMahon,Olivier Pietquin,Emmanuel Chemla,Sara Keen
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species or data-scarce domains. Here we reveal which speech-like features are encoded in bioacoustic representations. Using the 88~eGeMAPS features across six taxonomic groups, we apply linear and nonlinear regression probes to quantify which acoustic properties each model captures. Results confirm a ``no free lunch’’ pattern: no single model captures the full feature space. A concatenated embedding achieves the highest performance, suggesting complementary acoustic space coverage across models. Loudness features are best encoded ( R^2 = 0.76 ) while F0 is hardest to recover ( R^2 = 0.33 ). By cross-referencing recoverability with per-species feature salience (NMI), we derive data-driven model selection guidance for bioacoustics.

[LG-5] Graph Structured Combinatorial Semi-Bandit with Nonlinear Reward Associations through Separable Signals

链接: https://arxiv.org/abs/2606.14650
作者: Christoph Bauschmann,Setareh Maghsudi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The identification of optimal structures within vast arrays of interconnected data necessitates significant sampling- and computational effort. Learning and leveraging underlying signal dependencies can improve efficiency and predictive capabilities considerably, but the ubiquity of nonlinear statistical relations amplifies the complexity of such undertakings. In this paper, we develop novel generic and adaptive strategies equipped with routines for graph-based causal reward modeling, analytic reproducing kernel methods, and Taylor approximation of functional processes. We establish theoretical performance guarantees sublinear in time and linear in data volume over time. Our analyses cover robustness to a multitude of uncertainties arising from noise interference, gradual model convergence, and solution space mismatch. The framework’s general appeal is substantiated by a minimalistic set of conditions or reliance on prior estimates, while various outlined modifications address specific or extended settings. To demonstrate practical effectiveness, we conduct numerical experiments using both benchmarked synthetic and real-world transportation datasets.

[LG-6] Which Directions Matter? Sparse Design for Affine Robust Optimization UAI2026

链接: https://arxiv.org/abs/2606.14648
作者: Pedro Chumpitaz-Flores,My Duong,Juan S. Borrero,Kaixun Hua
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted at UAI 2026

点击查看摘要

Abstract:Robust machine learning and optimization rely on the uncertainty model choice. We investigate which uncertainty directions a model must cover when defined by a finite dictionary and a budget constraint. Selecting a subset forms an atomic uncertainty set with a closed form support function, yielding tractable robust programs for affine objectives. We propose a data driven selection rule based on a coverage objective over evaluation directions, including gradients, adversarial perturbations, or shifts observed on held out data. We prove this objective is monotone and submodular, supporting a greedy method with a (1-1/e) approximation guarantee and a matching hardness barrier. We also provide a certificate bounding the loss from the selected subset and a radius calibration rule with out of sample control.

[LG-7] Online Convex Optimization with Sublinear Noisy Probes COLT’26

链接: https://arxiv.org/abs/2606.14640
作者: Simone Di Gregorio,Anupam Gupta,Stefano Leonardi,Matteo Russo
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Accepted at COLT '26

点击查看摘要

Abstract:We study Online Convex Optimization (OCO) over a convex set K\subseteq \mathbb R^d , where in each round t the learner selects x_t\in K and then observes a convex loss f_t:K\to[0,1] , with the goal of minimizing regret to the best fixed decision in hindsight. We introduce a unified probing model that generalizes two recent lines of work: sublinear best-expert queries in the experts setting, and pairwise (comparison-based) feedback available every round in OCO. In our framework, the learner has a budget of k\le T pairwise probes; on a probed round it may query two points and learn which one has smaller loss. Our main result shows that even a sublinear and noisy probe budget can provably improve worst-case regret in the full feedback OCO regime. With k \delta -noisy pairwise probes, we obtain: \textReg_T \le O\left(\min\left\sqrtdT\ln T,; \fracdT\ln Tk|1-2\delta|\right\right) , which is tight (up to logarithmic factors in T ) across T , k and \delta . Specifically regarding the noise parameter \delta \in [0,1] , the regret guarantee smoothly degrades as the oracle response approaches a coin flip, i.e., \delta is close to \frac12 . When applying the same techniques to a finite K for the prediction with d experts setting, the resulting rates are instead completely tight in all parameters, including d . Our analysis gives a streamlined treatment of pairwise probing in OCO by quantifying the benefit of probing via a variance reduction effect, combined with a second-order (variance-based) analysis of Continuous Exponential Weights. Comments: Accepted at COLT '26 Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2606.14640 [cs.LG] (or arXiv:2606.14640v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.14640 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] Graph Diffusion Residuals for Control-Function Instrumental Variables

链接: https://arxiv.org/abs/2606.14636
作者: Rui Wu,Zongyuan Chen,Hong Xie,Defu Lian,Enhong Chen
类目: Machine Learning (cs.LG)
*备注: Submitted to Journal of Machine Learning Research (JMLR). 50 pages, 6 figures

点击查看摘要

Abstract:Control-function instrumental variable estimators need a first-stage residual, not merely a first-stage prediction. High-capacity first stages can interpolate treatment and leave too little residual information for the outcome equation. We study Adaptive Anisotropic Instrumental Heat Flow (A-IHF), a deterministic graph-diffusion residual extractor for flexible control functions. A-IHF treats treatment as a signal on a graph of first-stage features, uses pilot diffusion to detect large treatment jumps, attenuates conductance across those jumps, and computes the generated control with a sparse graph resolvent. Its observational selection rule uses only (Z,X) , combining graph generalized cross-validation, roughness, residualized-treatment relevance, and graph-admissibility filtering. The analysis decomposes error into structural leakage, residual attenuation, and residualized treatment variation, yielding finite-sample bounds, graph-admissibility rates under latent piecewise-smooth geometry, and finite-path selection calibration. Across 54 synthetic benchmark cells with tuned graph, kernel, tree, boosting, series, and neural control-function baselines, guarded observational A-IHF has the lowest average structural-response MSE; the A-IHF family beats the best non-A-IHF baseline in 32 cells. Performance is strongest when the graph captures piecewise-smooth first-stage structure.

[LG-9] Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens

链接: https://arxiv.org/abs/2606.14620
作者: Ali Asaria,Tony Salomone,Deep Gandhi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open diffusion language models are marketed as parallel, non-autoregressive decoders, yet the order in which a shipped checkpoint actually commits its tokens is almost never measured. We instrument DiffusionGemma 26B, a masked discrete-diffusion mixture-of-experts model built on Gemma 4, hooking its sampler’s accept step to record which canvas positions commit, when, and at what confidence. Across a 686-prompt, six-regime probe suite we find that its decoding is neither parallel nor block-autoregressive: it follows a partial left-to-right commit bias whose apparent strength depends almost entirely on the granularity at which you look. Order is weak token by token and strengthens smoothly as the analysis is coarsened, so the model’s “block size” turns out to be an artifact of the measuring ruler rather than the architecture. The model commits in large simultaneous batches, leaving much of the within-batch order genuinely undefined rather than merely unobserved. The behaviour is regime-dependent: structured JSON is committed in essentially arbitrary order, and a position’s commit confidence tracks correctness on mathematical reasoning but carries no signal on factual recall. Commitment is aggressive, finishing in a short late burst well inside the step budget, while task accuracy matches the model’s autoregressive Gemma-4 sibling. Beyond these findings, our central contribution is methodological: measuring decoding order honestly demands handling trailing-EOS padding, within-regime confounding, commit non-monotonicity, block-size sensitivity, and large commit-batch ties, each of which can otherwise manufacture a decoding-order result that is not really there.

[LG-10] A Statistical and Machine Learning Framework for Operational Threshold Detection and Deployable Dispatch Controller Development in Hydrogen Multi-Energy Systems

链接: https://arxiv.org/abs/2606.14601
作者: Shadi Heenatigala,Hasanika Samarasinghe
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Computation (stat.CO)
*备注: 17 pages, 12 figures

点击查看摘要

Abstract:This study presents a statistical and machine learning framework for characterizing a hydrogen-based multi-energy system (H-MES) using one year of high-resolution operational data. Statistical analysis revealed a binary operation driven by renewable surplus, with solar irradiance explaining 45.7% of rank-based variance in hydrogen production, a large effect by conventional standards. Only high-irradiance periods triggered meaningful electrolyzer engagement, while electricity demand exerted a weaker inverse suppression effect ( \epsilon^2 = 0.126 ). Multiple regression confirmed electrolyzer power as the dominant linear predictor, with a synergistic solar-wind interaction. Notably, Random Forest analysis ranked wind output first in predictive importance despite its weak bivariate correlation (r = 0.167), revealing non-linear dynamics invisible to parametric methods. A sequence model exploited strong 24-hour autocorrelation (r = 0.845) for operational forecasting, while a reinforcement learning agent optimized hydrogen revenue dispatch. The core contribution is demonstrating that statistical and machine learning approaches are complementary for H-MES modeling and control.

[LG-11] Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0

链接: https://arxiv.org/abs/2606.14598
作者: Ali Asaria,Tony Salomone,Deep Gandhi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training INT8 (W8A8) quantization of diffusion transformers is widely deployed as a speed optimization, yet on consumer Ampere GPUs it is frequently slower than the FP8 and NF4 alternatives it is meant to beat. We trace this to a software artifact: the production “INT8” forward quantizes weights and activations only to immediately dequantize them back to bf16 and run a bf16 matrix multiply, never engaging the GPU’s INT8 tensor cores, so the hardware’s compute advantage is left entirely unrealized. We close this gap with a single fused Triton INT8 GEMM (int8xint8-int32 on Ampere tensor cores, with per-token x per-channel dequantization and bias folded into the epilogue, autotuned per GEMM shape) dropped into the Ideogram 4.0 diffusion transformer’s linear layers in place of the dequantize-to-bf16 path. In the kernel, the int8xint8-int32 accumulation is bit-exact against torch._int_mm and the dequantized output matches the reference at cosine similarity 1.0 with no NaNs, running 2.8-4.2x faster than bf16 per GEMM. End to end it delivers a ~1.1x (~9-10%) speedup at 768px, and at 1024px it generates an image in 156.5 s on a single RTX 3090, faster than the single-card NF4 (164.5 s) and FP8 (172.9 s) baselines, at no measurable quality cost on these point estimates (PickScore/CLIPScore). INT8 thus goes from the slowest variant to the fastest, and 1024px becomes single-GPU feasible. The primary speed criterion (beat FP8, by ~9.5%) is comfortably met; the NF4 margin (~4.9%, single-run n=4) is within run-to-run variance we did not quantify and is best read as consistent with meeting the stretch target. We close with an honest deployment map: the win is specific to consumer Ampere, and on A100 and B200 the same kernel loses to those cards’ fast native bf16/FP8 paths.

[LG-12] Zero-shot generalization of transformer neural operators to larger domains

链接: https://arxiv.org/abs/2606.14597
作者: Armand de Villeroché,Sibo Cheng,Vincent Le Guen,Marc Bocquet,Rem-Sophia Mouradi,Patrick Armand,Alban Farchi,Patrick Massin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based neural operators have shown remarkable performance for approximating solution operators of partial differential equations on complex geometries. However, existing approaches implicitly assume a fixed domain size, which limits their ability to generalize at inference. In this work, we investigate domain extension, namely zero-shot inference on spatial domains that are significantly larger than those encountered during training. We argue that this setting fundamentally requires spatial locality and translation equivariance. We propose to implement this locality via a decomposable bias in the attention logits computation, enabling finely controllable locality while remaining fully decomposable into query-key inner products and directly compatible with optimized attention kernels. Combined with rotary positional embeddings, it enables expressive embeddings with controllable spatial support without altering the transformer architecture. We empirically show that our approach substantially improves zero-shot generalization to larger domains across two PDE benchmarks and a 3D industrial atmospheric flow application. Our code and datasets are available at this https URL.

[LG-13] CANN-EUCLID: unsupervised constitutive artificial neural network model discovery from full-field data

链接: https://arxiv.org/abs/2606.14565
作者: Benjamin Alheit,Siddhant Kumar,Mathias Peirlinck
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Constitutive artificial neural networks (CANNs) provide interpretable material model discovery, but have so far been used in stress-supervised settings based on apparent stress-strain data from homogeneous tests. Because each test samples only a narrow loading path and provides homogenized rather than local stress information, robust discovery typically requires multiple loading modes to constrain the multidimensional response. This is challenging for soft biological tissues, where repeated testing, damage, and sample variability limit reliable information from a single specimen. Here, we combine CANNs with the stress-unsupervised full-field discovery framework EUCLID to identify sparse hyperelastic laws directly from displacement fields and reaction forces in one heterogeneity-inducing loading case. CANN-EUCLID minimizes equilibrium imbalance with sparsity-promoting regularization selecting compact active terms, without local stress measurements or a prescribed law. We evaluate the approach on isotropic and anisotropic benchmarks with prescribed ground-truth laws. When the ground truth is representable by the chosen CANN basis, our method recovers the correct terms with near-exact accuracy, including exponential terms with embedded parameters. When it is not contained in the basis, the method retains shared terms and approximates missing contributions using available basis functions. Generalization depends strongly on sampled deformation states: exponential strain-stiffening terms can be recovered accurately when sufficiently probed, but can produce large extrapolation errors when the stiffening regime lies outside the sampled domain. Forward FE validation simulations show that the discovered behavior accurately replicates the ground truth. These results establish stress-unsupervised CANN discovery as a promising framework for interpretable full-field constitutive model identification.

[LG-14] ORCA: A Platform for Open-Source Dexterity Research

链接: https://arxiv.org/abs/2606.14561
作者: Francesco Capuano,Maximilian Eberlein,Fabrice Bourquin,Clemens Claudio Christoph
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:Robotics manipulation research increasingly focuses on two-finger parallel grippers for their effectiveness, affordability, and ease of teleoperation. Grippers are nonetheless limited by their form factor, often requiring bimanual setups even for simple reorientation tasks. Anthropomorphic hands are a more natural platform for dexterous robot learning – closer to the human hand, and capable of learning from human video – yet they remain hard to use in learning research: even where open and accessible hand hardware exists, the software for control, simulation, teleoperation, and retargeting is scattered in one-off code bases, and largely disconnected from the robot-learning ecosystem. In this work, we introduce the \orca~learning stack, an open-source research stack for dexterity as a first-class robot learning domain. Our \orca~stack unifies low-level control, simulation, teleoperation from a range of consumer platforms, and hand retargeting, behind a single interface, and integrates natively with popular robot-learning frameworks such as \lerobot, so dexterous hand researchers can leverage the same data, training, and evaluation pipelines used for non-dexterous robot learning. We demonstrate a complete end-to-end workflow, collecting expert demonstrations of an in-hand reorientation task by teleoperation with a consumer-grade VR headset, training an autonomous policy with \lerobot, and evaluating the learned policy in a fully reproducible and observable setup. We open-source the entire stack as a shared, reproducible foundation for dexterous-manipulation research.

[LG-15] Provably Safe Yet Scalable Reinforcement Learning

链接: https://arxiv.org/abs/2606.14536
作者: Kai S. Yun,Zeyang Li,Navid Azizan
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned policy. In contrast, methods with strict guarantees typically rely on explicit certificate functions, whose construction requires the direct synthesis and verification of control-invariant sets, a process that scales poorly with state dimension and often yields overly conservative behavior. In this paper, we present the Provably Safe, yet Scalable RL (PS2-RL) framework, a novel two-phase architecture for learning provably safe policies in a scalable manner, designed to overcome the key bottlenecks of prior methods. Rather than explicitly computing invariant sets, PS2-RL leverages a learned backup policy to forward-integrate the system dynamics, generating an implicit control-invariant set online. In the first phase, the backup policy is trained with our proposed safe-arrival value function, which characterizes the optimal backup policy for invariant-set construction. In the second phase, an RL policy is trained end-to-end through a differentiable projection layer that strictly enforces the safety guarantees induced by the learned backup policy. By maximizing the volume of the implicit control-invariant set in the first phase, the resulting PS2 policy from the second phase is performant and scalable, while maintaining provable safety. Crucially, PS2-RL imposes no restrictions on the underlying RL algorithm and can be plugged into any existing training pipeline. We establish theoretical guarantees for the proposed framework and evaluate it on robotic control tasks with state dimensions up to 10, a regime in which prior provably safe RL methods struggle or become impractical.

[LG-16] he Risk Shadow of Principal Component Analysis: When 99.9999% Variance Preservation Causes Catastrophic Decision Errors

链接: https://arxiv.org/abs/2606.14533
作者: Hamidou Tembine
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 5 tables, 1 figure. all references fully checked manually

点击查看摘要

Abstract:Principal Component Analysis (PCA) preserves variance, not the information needed to detect rare catastrophic events. This paper proves the existence of a \it Risk Shadow: PCA can retain over 99.9999 percent of total variance while completely erasing all signal about rare, high-impact failures. When this happens, even the best possible classifier operating on the PCA representation reduces to a constant predictor. The root cause is a fundamental mismatch between variance maximization and tail risk awareness. To break the shadow, we introduce Expectile PCA (ExPCA) and Tail-Preserving PCA (TP-PCA), two methods that reweight the data covariance toward high-impact events. We prove theoretically that ExPCA strictly outperforms PCA in retaining rare-event information, and we validate our claims on synthetic data and a real-world credit card fraud detection benchmark. Our results call for a fundamental rethinking of variance-based dimensionality reduction in high-stakes decisions.

[LG-17] Code Correctness Signals in LLM Hidden States: Pre-Generation Probing and Repair Geometry

链接: https://arxiv.org/abs/2606.14530
作者: Carlo Di Cicco
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 tables. Code, data, and analysis scripts available at this https URL

点击查看摘要

Abstract:Large language models encode rich information in their hidden states. This work asks whether code correctness is legible in the hidden states of Qwen3-4B-Instruct-2507, before it generates and as it repairs a failed attempt, studied on 444 LiveCodeBench tasks. It reports two findings connected by a single confound-control tool: residualization. First, the correctness of the model’s first-attempt code is linearly decodable from the prompt-final hidden state, with a leakage-free held-out AUC of 0.931 +/- 0.008 across 50 outer splits. After the linear effect of prompt length is removed from each hidden state dimension, the probe still reaches 0.911 +/- 0.010, well above a prompt-length baseline of 0.754 +/- 0.014. Second, on 236 cleaned cases where the model attempts to repair a failed first attempt, the hidden state shift from the failing attempt to its repair carries a statistically detectable contrastive direction, significant on both a magnitude and a split-half test against label-shuffled nulls. This direction does not survive a conditional residualization against repair-context covariates that differ between successful and failed repairs, marking it as a correlate of repair success driven by the repair context rather than an isolated repair-comprehension feature. The probe layer is selected by nested cross-validation, and the same residualization approach that upholds the pre-generation correctness result overturns the repair-direction interpretation. The contribution is as much methodological as empirical: a diagnostic honest enough to report a negative result alongside a positive one.

[LG-18] Behavioral Audit of Machine Unlearning Has a Privacy Cost

链接: https://arxiv.org/abs/2606.14518
作者: Liou Tang,James Joshi,Ashish Kundu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The removal of learned data from Machine Learning models through Machine Unlearning (MU) has been widely studied; however, there has yet to be an agreed-upon scheme for auditing MU. Existing work has shown that a dishonest model owner can falsify evidence to avoid executing MU, while curious auditors (and adversaries) can infer the privacy-sensitive properties of the model and its training data even with limited access. Yet auditing of MU under mutual distrust between the model owner and the auditor remains unexplored. We provide an information-theoretic proof for this scenario: for convex ML models, a generic audit scheme that relies solely on querying the model for \textitbehavioral signals cannot identify insufficiently unlearned models without revealing membership information of the retained set. Therefore, auditing MU under the assumption of a dishonest model owner and an honest-but-curious auditor faces an inherent privacy-audit tradeoff. Our empirical results on convex models strongly supports this result, while further experiments demonstrate that this privacy-audit tension persists in non-convex models. Our results call for a more careful consideration of the privacy-audit tension under a realistic auditor threat model, and serve as a foundation for more scrutiny of designs of privacy-preserving audit schemes for the MU pipeline. We also release our code implementation at this https URL.

[LG-19] PepALD: Macrocyclic Peptide Generation via Autoregressive Latent Diffusion

链接: https://arxiv.org/abs/2606.14510
作者: Junming Zhang,Siyu Yi,Wei Ju,Zhonghui Gu
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Macrocyclic peptides are promising therapeutic candidates for intracellular targets, but their design requires simultaneous control over non-natural monomer chemistry, ring topology, membrane permeability, and target binding. Existing SMILES- or HELM-string generative models either operate in long atom-level sequence spaces or treat monomers as symbolic tokens with limited chemical grounding. We introduce PepALD, an Autoregressive Latent Diffusion (ALD) foundation model for \textitde novo macrocyclic peptide generation. The model represents HELM monomers with structured chemical embeddings, generates each residue through context-conditioned diffusion in chemically informed latent space, predicts R-group-aware ring closures during autoregressive generation, and aligns the denoiser to affinity rewards using winner-protected diffusion-adapted preference optimization. In silico experiments demonstrate PepALD’s generation quality and reward-optimization performance against representative peptide generation baselines.

[LG-20] Recipe-Controlled Decoder Audit for Structural Knowledge-Graph Completion

链接: https://arxiv.org/abs/2606.14492
作者: Xihang Shan,Ye Luo
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures. Code and artifacts: this https URL

点击查看摘要

Abstract:We present a recipe-controlled decoder audit (RCDA) for structural transductive knowledge-graph completion (KGC). The audit asks a simple reporting question: before attributing gains to an encoder or training recipe, what changes when the decoder is swapped under the same recipe? Using ComplEx and DistMult as the primary controlled pair, with targeted RotatE/TransE spot-checks, we evaluate seven benchmarks. On five standard KGs, ComplEx-vs-DistMult differences are modest but consistent under our recipe (+0.005 to +0.012 MRR), whereas CompGCN-style encoder effects vary more by dataset. On small KGs, decoder effects become the main diagnostic: Kinship shows a stable ComplEx advantage of +0.143 MRR (6 seeds), while UMLS favours ComplEx by +0.022 MRR in a clean 6-seed server rerun but reverses in an earlier provenance variant. We therefore treat small-KG decoder choice as recipe- and provenance-sensitive rather than as a fixed dataset winner. We further show that decoder choice interacts with encoder depth on WN18RR, and that under our recipe L=0 ComplEx on YAGO3-10 reaches 0.6971 +/- 0.0048 MRR at d=128. The result is a compact audit protocol: report matched decoder rows, log small-KG provenance, and sweep decoder x depth before making encoder-level claims.

[LG-21] Nonlinear Two-Time-Scale Stochastic Approximation: A Sharp Phase Transition and How to Beat It

链接: https://arxiv.org/abs/2606.14488
作者: Dhruv Sarkar,Vaneet Aggarwal
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent finite-time analyses of nonlinear two-time-scale stochastic approximation show that under contractive assumptions the slow iterate Y_k with stepsizes \beta_k=\Theta(k^-1) and \alpha_k=\Theta(k^-a) , a\in(1/2,1) , generally satisfies a mean-square rate of order k^-a ; decoupled k^-1 rates require strong local linearity. We identify a sharp regularity-dependent boundary. In a rate-determining normal form where the slow drift contains a locally linear leakage and a nonlinear remainder of order 1+\rho ( \rho\in[0,1] ), the uncorrected recursion satisfies [ \mathbbE|Y_k|^2 \le C\bigl(k^-1+k^-a(1+\rho)\bigr), ] and a matching scalar Gaussian lower bound shows that the slower term is unavoidable without modifying the update. Thus the decoupled k^-1 rate is guaranteed for the uncorrected recursion exactly when a(1+\rho)\ge 1 . This lower bound concerns only the naive update; it is not an information-theoretic obstruction. We demonstrate this by equipping the normal-form recursion with an auxiliary online bias estimator [ M_k+1=M_k+\gamma_k(R(X_k)-M_k),\qquad \beta_k\ll\gamma_k\ll\alpha_k, ] and subtracting M_k from the slow update. Under the same stability, moment, and remainder assumptions, the corrected recursion achieves \mathbbE|\widetilde Y_k|^2=O(k^-1) for every \rho\in[0,1] , including regimes where the uncorrected update provably suffers the slower rate. Finally, we prove localized transfer theorems that extend the phase-transition mechanism to general nonlinear TTSA in fast-manifold coordinates. The proofs are non-asymptotic and rely on two Abel-transform cancellations: one for the locally linear fast-error leakage, and one for the tracked nonlinear bias.

[LG-22] EM-NeSy: Expectation Maximization for Neurosymbolic Learning

链接: https://arxiv.org/abs/2606.14463
作者: Annegret Seibt,Luc De Raedt,Giuseppe Marra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating the use of approximate inference. We propose EM-NeSy which casts probabilistic NeSy learning as an instance of the Expectation-Maximization (EM) algorithm. In the expectation step, we compute the posterior over the neurally predicted symbols conditioned on the label via probabilistic inference. In the maximization step, we update the neural parameters based on this posterior using gradient descent only through the neural component. This formulation unlocks the full potential of the EM algorithm for NeSy learning. It allows NeSy to extend naturally to approximate reasoning without any additional modifications or differentiability requirements of the symbolic component. Furthermore, it recovers the standard end-to-end gradient-based NeSy setting under exact inference. Our experimental results demonstrate the scalability and computational efficiency of EM-NeSy.

[LG-23] Federated Learning for Feature Generalization with Convex Constraints ICML2025

链接: https://arxiv.org/abs/2606.14416
作者: Dongwon Kim,Donghee Kim,Sung Kuk Shyn,Kwangsu Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Federated learning (FL) often struggles with generalization due to heterogeneous client data. Local models are prone to overfitting their local data distributions, and even transferable features can be distorted during aggregation. To address these challenges, we propose FedCONST, an approach that adaptively modulates update magnitudes based on the parameter strength of the global model. This prevents over-emphasizing well-learned parameters while reinforcing underdeveloped ones. Specifically, FedCONST employs linear convex constraints to ensure training stability and preserve locally learned generalization capabilities during aggregation. A Gradient Signal to Noise Ratio (GSNR) analysis further validates the effectiveness of FedCONST in enhancing feature transferability and robustness. As a result, FedCONST effectively aligns local and global objectives, mitigating overfitting and promoting stronger generalization across diverse FL environments, achieving state-of-the-art performance.

[LG-24] A theoretical model for task routing in mixture-of-expert transformers

链接: https://arxiv.org/abs/2606.14398
作者: Yongli Xiang,Vinoth Nandakumar,Yunzhi Yao,Peike Li,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-experts (MoE) layers enable the scaling of transformer models while keeping the inference compute fixed. While task-expert specialization has been observed in empirical studies of frontier MoE transformer models, existing theoretical work analyzes this using continuous mixture models that cannot be used to model natural language effectively. An important open question is to \textittheoretically explain task-expert specialization in transformer MoE models using discrete models of language. To address this, we represent structured knowledge via syntactic templates and finite key-value dictionaries, and prove formally that a single-layer MoE transformer can encode knowledge by using experts that specialize in the corresponding tasks. Our construction shows how queries are routed to unique, task-specific experts whose size depends solely on the intrinsic complexity of the given task (i.e. the combined size of its syntactic templates and factual dictionary). Our construction provides a theoretical support for empirical results on localized knowledge circuits in MoE models. We support our theoretical findings with experiments evaluating model performance under varying MoE loss functions.

[LG-25] Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

链接: https://arxiv.org/abs/2606.14397
作者: Mykola Vysotskyi,Runqi Lin,Grzegorz Biziel,Michal Zakrzewski,Sebastian Montagna,Damian Rynczak,Shreyansh Padarha,Kumail Alhamoud,Zihao Fu,William Lugoloobi,Kai Rawal,Hanna Yershova,Xander Davies,Taras Rumezhak,Guohao Li,Fazl Barez,Baoyuan Wu,Arkadiusz Drohomirecki,Yarin Gal,Chris Russell,Christopher Summerfield,Adam Mahdi,Volodymyr Karpiv,Philip Torr,Adel Bibi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities while overlooking broader dimensions, resulting in saturated performance on modern agents and failing to probe their limitations. To this end, we introduce GauntletBench, a web-based benchmark for evaluating agent generalisation in challenging scenarios, focusing on three underexplored capabilities (temporal perception, graphical understanding, and 3D reasoning), across five less-covered professional applications (Video Editor, Workflow Builder, 3D Modeller, Flight Analyser, and Circuit Designer), each with 20 vision-intensive tasks (100 in total). Our benchmark provides a modular pipeline that comprises an environment compatible with both open- and closed-source agent frameworks, a controlled web-based application, a well-structured task suite, and an automated evaluation engine with diverse metrics. Contrary to widespread expectations, our empirical results reveal that frontier agentic systems remain far from achieving human-level performance. Even the state-of-the-art agent achieves only a 19.1% success rate on our GauntletBench, highlighting the limitations in these overlooked capabilities and generalisation. By comparison, non-expert human annotators achieve over 80% success on our challenging yet feasible tasks, revealing the substantial gap between current agent capabilities and those required for complex real-world scenarios.

[LG-26] A Low-Rank Subspace Analysis of LLM Interventions ICML2026

链接: https://arxiv.org/abs/2606.14388
作者: Angira Sharma,Christian Schroeder de Witt,Philip Torr,Anisoara Calinescu,Jialin Yu
类目: Machine Learning (cs.LG)
*备注: Mechanistic Interpretability Workshop @ ICML 2026

点击查看摘要

Abstract:Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable safety controls. To understand these side-effects, we introduce a diagnostic framework for analyzing interacting behaviors in LLMs. We model behaviors as low-rank subspaces in activation space, and study how interventions influence across behaviors. Across multiple instruction-tuned models (7B-70B) and across refusal, jailbreak, and sycophancy settings, we find that different behaviors share internal representations, and intervening on one behavior alters others in asymmetric ways. Some behaviors act as upstream control points whose interventions propagate broadly across other behaviors, while others remain more isolated. We relate these effects to two geometric quantities: (i) the overlap between behavior subspaces, measured as the average squared cosine of principal angles, and (ii) the angle between each behavior subspace and the decision subspace (capturing the model’s final decision e.g., refuse vs. comply). Empirically, intervention effects on other behaviors tend to be larger for behavior pairs with higher subspace overlap, and for source behaviors whose subspaces lie closer (smaller angle) to the decision subspace. These findings highlight a challenge for targeted behavior control: behaviors are difficult to modify independently, as interventions can propagate through shared representations and asymmetric interactions.

[LG-27] SemPiper: Interactive Code Synthesis for Semantic Operators in Machine Learning Pipelines VLDB2026

链接: https://arxiv.org/abs/2606.14361
作者: Olga Ovcharenko,Luciano Duarte,Sebastian Schelter
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Accepted at VLDB 2026 (Demonstrations track)

点击查看摘要

Abstract:Machine learning (ML) pipelines require extensive data preparation, feature engineering, and integration across heterogeneous sources, making them tedious and error-prone to develop. While large language models (LLMs) have recently shown promise for assisting programming tasks, chat-based interfaces provide limited control over pipeline behavior and often produce code that is difficult to optimize or integrate into production systems. We demonstrate SemPipes, a novel programming model that extends ML pipelines with declarative, LLM-powered semantic data operators. SemPipes allows developers to specify high-level natural language instructions for data-centric operations, while seamlessly combining these operators with arbitrary Python code from standard data science libraries. For the semantic operators, it synthesizes specialized implementations at pipeline training time, conditioned on dataset characteristics and pipeline context, enabling the flexible yet controlled integration of LLM capabilities. We demonstrate SemPipes through SemPiper, an interactive interface that visualizes computational graphs of the pipelines, synthesized operator implementations, and optimization trajectories produced by an evolutionary search procedure. Attendees can explore three end-to-end scenarios, modify pipelines, inspect generated code, and observe how semantic operators are synthesized and iteratively optimized. The demonstration highlights how declarative semantic operators enable controllable, optimizable, and practical integration of LLMs into ML pipeline development.

[LG-28] MUFFLe: Efficient Model Update Compression via Generalized Deduplication for Federated Learning

链接: https://arxiv.org/abs/2606.14354
作者: Xiaobo Zhao,Daniel E. Lucani
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE EDGE 2026 (Work-in-Progress track)

点击查看摘要

Abstract:Federated learning is well suited to edge environments but is often limited by the uplink cost of transmitting model updates. This Work-in-Progress paper presents MUFFLe, a communication-efficient update compression scheme that integrates generalized deduplication (GD) into the FedAvg pipeline. MUFFLe deduplicates repeated patterns across the update vector, yielding a fixed-rate, variable-count compression scheme. Preliminary experiments on IID MNIST with 20 clients show that MUFFLe reaches the target accuracy of 92.93% with 38~MB cumulative uplink communication, compared with 75~MB for 8-bit quantization, 86~MB for Top- k sparsification, and 310~MB for uncompressed FedAvg. These results demonstrate the feasibility of applying GD to communication-efficient federated learning.

[LG-29] Can Deep Neural Networks Improve Compression of Very Large Scientific Data?

链接: https://arxiv.org/abs/2606.14353
作者: Muhannad Alhumaidi,Guozhong Li,Spiros Skiadopoulos,Panos Kalnis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Error-bounded lossy compression is a fundamental technique for managing the rapidly growing volumes of scientific data produced by modern simulations and observational instruments. Most state-of-the-art-compressors follow a prediction-residual paradigm, where compression effectiveness depends on the quality of the predictor: more accurate predictions generate smaller residuals that are easier to compress. This observation raises a question: can modern machine learning models serve as superior predictors for scientific data compression? Answering this question directly is challenging because developing compression-specific ML predictors requires substantial resources. Instead, we leverage the climate domain where highly accurate pretrained weather forecasting foundation models already exist, making them an ideal testbed. We present a framework that integrates spatial and temporal deep learning models into a conventional error-bounded compression pipeline. The framework supports auto-regressive forecasting models and avoids error accumulation. Using ERA5 climate data as a representative large-scale scientific dataset, we evaluate three distinct ML predictors: a VAEformer-based codec (CRA5), a graph neural network forecaster (GraphCast), and a vision-transformer forecaster (Aurora), against the state-of-the-art compressor SZ3.1 under identical quantization and entropy-coding backends. Our evaluation over approximately 1.7 TB of data reveals a surprising result: although ML predictors generate more accurate predictions and can improve reconstruction quality by up to 91% while achieving up to 9.6x higher compression ratios for highly predictable variables, they do not improve overall dataset-level compression ratio. We show that prediction accuracy alone is insufficient: the spatial structure of the resulting residuals plays a decisive role in entropy coding efficiency.

[LG-30] When Language Representations Interact: Separability and Cross-Lingual Effects in LLM s ICML2026

链接: https://arxiv.org/abs/2606.14347
作者: Boris Marinov,Angira Sharma,Christian Schroeder de Witt,Philip Torr,Anisoara Calinescu,Jialin Yu
类目: Machine Learning (cs.LG)
*备注: Trustworthy AI for Good (AI4Good) Workshop @ ICML 2026

点击查看摘要

Abstract:Large language models exhibit strong multilingual capabilities, however, their internal representations are difficult to interpret. Understanding these interactions is important for ensuring reliable behavior in multilingual systems. Recent work has shown that causal-geometric structure can explain how certain concepts are encoded as approximately linear and separable directions, but whether this framework extends to multilingual models, where language identity is correlated and hierarchical, is underexplored. We apply causal-geometric analysis to multilingual LLMs, studying 28 bilingual contrasts across three models, allowing us to analyze when languages behave as approximately independent factors and when structured dependencies persist. We find evidence that language concepts admit stable linear representations that are largely separable under a covariance-adjusted (causal) inner product, with structured deviations reflecting linguistic similarity. Moreover, languages within the same family (such as Germanic or Romance) exhibit a simplex-like geometric structure, suggesting hierarchical organization. These results extend causal-geometric interpretability to multilingual settings and provide insight into how separability and similarity may exist in multilingual LLM representations, motivating interpretability analyses that diagnose when and how structured dependencies between concepts can be anticipated. This has implications for trustworthy deployment, as residual structure between languages may lead to unintended cross-lingual effects when models are monitored or intervened upon.

[LG-31] More with LESS – Local Scene Representations for Tactile Imaging

链接: https://arxiv.org/abs/2606.14344
作者: Zohar Rimon,Elisei Shafer,Tal Tepper,Daniel Kozin,Alon Malka,Roy Holland,Aviv Tamar
类目: Machine Learning (cs.LG)
*备注: RSS 2026

点击查看摘要

Abstract:Tactile imaging seeks to reconstruct the internal structure of soft objects through touch sensing, with applications in medical diagnosis and robotic manipulation. Recent self-supervised learning approaches have shown promising results, but rely on global, unstructured representations and robot-controlled sensing, limiting generalization and practical use. We propose Local Encoder for Spatial Sensing (LESS), an object-centric tactile representation that exploits the local nature of touch. The tactile scene is modeled as a grid of recurrent encoders with local receptive fields, whose states are fused to reconstruct 2D or 3D images of internal structure. This compositional design enables strong generalization: models trained on single-inclusion phantoms accurately image objects with multiple inclusions and varying sizes. The local structure further supports spatial uncertainty estimation. In addition, we enable hand-held tactile imaging via external pose tracking and human-like palpation data, and extend tactile imaging to full 3D reconstruction.

[LG-32] Riemannian Metric Matching for Scalable Geometric Modeling of Distributions ICML2026

链接: https://arxiv.org/abs/2606.14334
作者: Jacob Bamberger,Adam Gosztolai,Pierre Vandergheynst,Michael Bronstein,Iolo Jones
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: ICML 2026 (Oral)

点击查看摘要

Abstract:High-dimensional datasets often concentrate near low-dimensional structures, but estimating their geometry from samples typically relies on graphs and kernels that scale poorly with dataset size and dimension. We propose Riemannian metric matching: a denoising probabilistic framework for learning the Riemannian geometry of data using neural networks. Specifically, we learn the carré du champ operator, which, using diffusion geometry, gives us access to the Riemannian geometry toolkit for downstream machine learning and statistical tasks. Our key observation is that the carré du champ operator can be formulated as a conditional expectation over random perturbations of the data, which can be exploited for sample-wise training and constant cost, amortized inference without explicit kernel construction. Empirically, metric matching rivals or improves the accuracy of k -NN-based diffusion geometry estimators, while enabling amortized inference that is up to 400\times faster, and supports graph-free geometric analysis on high-dimensional images where nearest neighbors break down.

[LG-33] Beyond a Single Explanation of the Adam–SGD Gap

链接: https://arxiv.org/abs/2606.14259
作者: Chenxiang Zhang,Rustem Islamov,Enea Monzio Compagnoni,Jun Pang,Aurelien Lucchi,Antonio Orvieto
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Prior work has identified several factors that can contribute to the performance gap between Adam and SGD, spanning data aspects, architecture design, and optimization properties. Yet these explanations are often studied in isolation, leaving their relative importance unclear. In this work, we revisit these hypotheses through a controlled empirical study across vision, language, genomics, and graph tasks, spanning modern and classical architectures, and carefully designed training setups. Our results suggest that no single factor consistently explains the Adam–SGD gap. For instance, the Adam advantage can (1) persist under a uniform vocabulary distribution yet nearly disappear under a heavy-tailed one; (2) reverse in favor of SGD in softmax-attention models; and (3) become larger under soft architectural modifications, e.g., when ReLU is replaced by a GeLU nonlinearity. This suggests that the gap arises from nontrivial data and architecture interactions, rather than from a single common factor. Yet, we observe a pattern across our settings: a \emphcrossover batch size at which the relative advantage shifts from SGD to Adam as the batch size scales. These empirical results are captured by our theoretical gap model, which predicts this batch-size-dependent crossover. Our perspective helps reconcile several existing hypotheses while offering practical insights across domains.

[LG-34] Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability

链接: https://arxiv.org/abs/2606.14245
作者: Ali Vefghi,Zahed Rahmati,Mohammad Akbari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drug-target interaction (DTI) and affinity (DTA) predictors increasingly achieve strong benchmark scores, yet their internal use of sequence, fingerprint, and graph features often remains opaque. We present an interpretability audit of BridgeDPI architecture on three different datasets including Gao, Human, and this http URL. This study combines gradient-based attributions – integrated gradients, saliency, layer-wise relevance propagation, SmoothGrad, and SmoothGrad-IG – with feature-wise occlusion ablation and strict intersection consensus across methods to reduce single-explainer bias. We summarize sensitivity and signed effects at raw inputs, at the bridge similarity scaffold, and through the graph convolution, including edge-level sensitivities and targeted edge removals. The results show that explainability is most informative when treated as model criticism: it reveals modality dominance, padding and special-token artifacts, dataset-dependent cooperative versus suppressive effects across layers, and chemistry-consistent fragment and composition motifs where methods agree. These analyses do not substitute for structural or experimental ground truth, yet they can provide testable hypotheses for downstream validation in computational drug discovery pipelines. More broadly, applying modern XAI to contemporary DTI/DTA models is still an early pass over the rich structure implicit in trained weights and data – yet even this first layer of scrutiny already helps researchers relate predictions to drug- and target-side representations and to prioritize external validation.

[LG-35] Implicit Variational Rejection Sampling

链接: https://arxiv.org/abs/2606.14235
作者: Jian Xu,Shigui Li,Wei Chen,Jiacheng Li,Zhiqi Lin,Delu Zeng,Xinghao Ding,John Paisley,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational Inference (VI) is a fundamental inference technique in Bayesian machine learning for approximating complex posterior distributions. Traditional VI often relies on the mean-field factorization, which can inadequately capture true posterior complexity. Recent advancements have leveraged neural networks to model implicit distributions, offering increased flexibility. However, the practical constraints of neural network architectures still produces inaccuracies. In this paper, we propose a method called Implicit Variational Rejection Sampling (IVRS), which integrates implicit distributions with rejection sampling to improve the posterior approximation. Our method uses neural networks to construct implicit proposal distributions, and rejection sampling with a discriminator network that estimates the density ratio between the implicit proposal and the true posterior for refining the approximation. Towards this end, we introduce the Implicit Resampling Evidence Lower Bound (IR-ELBO) as a metric to characterize the resampled distribution’s quality and derive a tighter variational lower bound. Experimental results demonstrate that our method outperforms traditional variational inference techniques.

[LG-36] Learning the Context of Errors: Black-Box Online Adaptation of Time Series Foundation Models

链接: https://arxiv.org/abs/2606.14222
作者: Xilin Dai,Yiding Liu,Hongjie Xia,Yifan Hu,Zewei Dong,Jiang-Ming Yang,Qiang Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid evolution of Time Series Foundation Models (TSFMs) has advanced zero-shot forecasting across diverse domains. Inspired by the current form of Large Language Models, future TSFMs may be offered as commercialized, closed-source API services. However, many existing online adaptation methods still rely on white-box access for parameter fine-tuning or gradient backpropagation. This paradigm mismatch raises a question: In black-box online adaptation for TSFMs, what should we learn? We answer this with an insight: the predictive errors of the base model are conditioned on both the input and output of the base model (i.e., the context of errors). To validate this insight, we propose ORCA (Online Residual Contextual Adaptation). We conduct extensive experiments across 5 state-of-the-art TSFMs and 8 datasets to demonstrate the effectiveness of our approach. Furthermore, through ablation studies, we quantitatively analyze the impact of different adapter learning hypotheses on the final adaptation performance in black-box online adaptation. Code available at this https URL.

[LG-37] Curvature-Informed Potential Energy Surface for Protein-Ligand Binding Affinity Prediction

链接: https://arxiv.org/abs/2606.14217
作者: Peng-Fei Sun,Chuan-Xian Ren,Hong Yan
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Accurate prediction of protein-ligand binding affinity is essential for structure-based drug discovery. Recent geometric deep learning methods have achieved promising performance by representing protein-ligand complexes as three-dimensional graphs. However, most existing approaches mainly rely on static interaction geometry from a single bound conformation, while neglecting molecular flexibility and binding-induced conformational changes. To address this limitation, we propose a curvature-informed potential energy surface (CPES) graph neural network for protein-ligand binding affinity prediction, which incorporates physics-informed curvature representations to model conformational flexibility. CPES first derives curvature spectral descriptors from the Hessian of the potential energy surface evaluated at equilibrium configurations, whose eigenvalues define the local principal curvatures of the potential energy surface. It then uses spectral cross-attention to compare the unbound ligand and protein with the bound complex, thereby capturing binding-induced changes in conformational dynamics. In parallel, hierarchical protein-ligand interaction representations are learned from static structural features through geometry-aware message passing, soft clustering, and bidirectional cross-attention. Finally, CPES fuses the curvature-informed dynamic representations with static interaction representations for affinity regression. Extensive evaluations on multiple benchmark datasets demonstrate that CPES achieves improved predictive performance and offers physical interpretability.

[LG-38] LapidaryEngine: Fully Conversational Crystal Generation

链接: https://arxiv.org/abs/2606.14215
作者: Yusei Ito,Yuta Suzuki,Tomoya Murata,Masaki Adachi
类目: Machine Learning (cs.LG)
*备注: 11 main pages, 5 main figures, and 1 table

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has inspired the vision of generating bespoke crystal materials directly from natural-language instructions, enabling users to design materials through intuitive, conversational interaction. Existing text-to-crystal generative models represent important early steps toward this goal, but they suffer from two critical limitations: (i) restricted input formats that require highly structured descriptions (e.g., chemical formulas), and (ii) one-directional generation, where models can map text to crystal but cannot perform the inverse. These limitations prevent fully conversational workflows and hinder alignment with users’ inherently ambiguous and evolving desiderata. We address these challenges with LapidaryEngine, the first model to support fully conversational crystal generation. LapidaryEngine accepts free-form natural-language requests and performs iterative refinement and editing in a dialogue-like manner. The key innovation is a pivot representation, a third, intermediate form that enables bidirectional translation between text and crystal structures despite the absence of direct paired datasets. Leveraging this pivot allows robust interpretation of user feedback and precise structural control. We demonstrate LapidaryEngine across diverse tasks, including insulator discovery, stability optimization, compositional modification, and structural editing, showcasing its ability to align generated materials with user intent in an interactive manner.

[LG-39] Structured Noise Adaptation for Sequential Bayesian Filtering with Embedded Latent Transfer Operators

链接: https://arxiv.org/abs/2606.14195
作者: Naichang Ke,Pongpisit Thanasutives,Yoshinobu Kawahara
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted by TMLR

点击查看摘要

Abstract:Kalman filters based on the Embedded Latent Transfer Operators (ELTO) emerge as novel statistical tools for sequential state estimation. However, a critical limitation stems from their use of simplified noise models, which fail to dynamically adapt to non-stationary processes. To address this limitation, we introduce an ELTO-based Bayesian filtering approach with a new structured parameterization for the filter’s noise model. This parameterization enables structured noise adaptation, which couples the data-driven learning of an optimal time-invariant noise model with dynamic parameter adaptation that responds to changes in dynamics within non-stationary processes. Empirical results show that our structured noise adaptation improves the filter’s dynamic state estimation performance in noisy, time-varying environments.

[LG-40] DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation ICML2026

链接: https://arxiv.org/abs/2606.14192
作者: Miduo Cui,Haochen Wang,Shangqin Mao,Xun Yang,Qianlong Xie,Xingxing Wang,Xuri Ge,Ying Zhou,Zhiwei Xu
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:Auto-bidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement learning and, more recently, Transformer-based sequence modeling have shown promise for learning bidding policies from logged data, but their unimodal and purely parametric formulations often collapse multiple effective bidding strategies into suboptimal averaged actions and perform unreliably under sparse or long-tail traffic. To mitigate these limitations, we propose DRIVE (Distributional and Retrieval-Augmented Bidding with Value Evaluation), a unified Transformer-based framework that decouples candidate action generation from decision making for offline auto-bidding. DRIVE combines distributional action modeling, retrieval-augmented candidate generation from high-quality historical decisions, and value-based evaluation to select the most promising bid at inference time. Extensive experiments on AuctionNet and additional offline reinforcement learning benchmarks demonstrate that DRIVE consistently improves bidding performance and generalizes well across multiple Transformer-based methods.

[LG-41] Zeta: Dual Whitening for Matrix Optimization via Coordinate-Adaptive Preconditioning

链接: https://arxiv.org/abs/2606.14187
作者: Kaiwen Chen,Shuhai Zhang,Qiuwu Chen,Zimo Liu,Linxiao Li,Ying Sun,Yuchen Li,Yifan Zhang,Bo Han,Mingkui Tan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale neural network training increasingly relies on matrix-aware optimizers that exploit the structure of weight parameters beyond element-wise adaptation. However, existing matrix-aware methods such as Muon have an underappreciated vulnerability: their core operation, Newton-Schulz iteration, depends critically on input conditioning, yet the raw momentum matrices exhibit severe coordinate-wise scale heterogeneity. In this paper, we first verify this scale heterogeneity through a chi-square uniformity test, showing that intra-matrix scale imbalance is prevalent across Transformer layers and that coordinate whitening effectively corrects it. Motivated by this finding, we propose Zeta, a dual whitening optimizer that applies coordinate whitening and spectral whitening in a strictly ordered pipeline. The ordering is not a tunable choice but follows from a mathematical dependency: coordinate whitening establishes the statistical isotropy that spectral whitening requires to function reliably. We further prove that this dual pipeline strictly reduces orthogonalization error relative to pure spectral methods by improving the condition number of the input. Empirically, Zeta matches or surpasses strong baselines across language modeling (0.6B to 8B parameters), mixture-of-experts architectures, and vision tasks, demonstrating that resolving scale imbalance before orthogonalization leads to faster convergence and better generalization. Code is available at this https URL.

[LG-42] Robin-Neumann Coupling of PINN and FEM Solvers: A Steklov-Poincaré View with Application to Fluid-Structure Interaction with Contact

链接: https://arxiv.org/abs/2606.14181
作者: Mikel Landajuela
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) are meshless and carry moving geometry and topology change through resampling of collocation points; the finite-element method (FEM) is the workhorse for boundary-fitted discretisations. Coupling the two across a shared interface promises the best of both, yet existing PINN-FEM schemes are validated only empirically. We put the coupling on a domain-decomposition footing: viewing each solver as a Steklov-Poincaré (trace-to-flux) operator, we transfer the classical Dirichlet-Neumann (DN) divergence diagnosis and its Robin-Neumann (RN) cure, including a closed-form, sweep-free interface impedance, and prove a PINN-specific contraction theorem: a trained network realises only a perturbed Steklov operator with a per-step training residual, and RN still contracts, with no shared-eigenbasis hypothesis, to a floor set by the achieved training loss. Because a PINN has no stiffness matrix, we introduce a Fourier-mode interface probe that recovers the network’s resolvable Steklov eigenvalues to within 0.5% and doubles as a diagnostic of the network’s spectral cap. The theory predicts measured PINN-FEM contraction rates to within 7% on 1D and 2D Poisson couplings, and a two-slab analogue of the large-added-mass regime shows RN’s per-mode impedance matching winning decisively where tuned scalar relaxation saturates. We demonstrate the framework on a Stokes/rigid-disc problem with Alart-Curnier contact: the meshless PINN fluid absorbs the topology change at contact by collocation exclusion alone, no remeshing and no cut cells, and the static-equilibrium contact reaction matches the submerged weight to 0.4% under mesh refinement. We quantify remaining limitations: the warm-started PINN drifts off the Stokes manifold over long horizons, and matched FEM-FEM benchmarks attribute pre-impact squeeze-film signatures to PINN under-resolution.

[LG-43] Machine Learning for Biomedical Raman Spectroscopy: From Spectral Acquisition to Clinical Translation

链接: https://arxiv.org/abs/2606.14169
作者: Bogdan Oancea,Ana Maria Seciu-Grama,Nicoleta Siminea,Laura Mihaela Stefan,Alice Stoica,Joel Sjoberg,Marian Necula,Ana-Maria Prelipcean,Corneliu Ovidiu Vrancianu,Eduard Milea,Andrei Păun,Ion Petre,Mihaela Păun
类目: Machine Learning (cs.LG)
*备注: 52 pages, 2 figures

点击查看摘要

Abstract:Raman spectroscopy provides label-free, chemically specific characterization of biological systems and has become an important tool for cancer diagnosis, molecular subtyping, microbiological identification, and intraoperative decision support. Biomedical Raman spectra are, however, high-dimensional, noisy, and affected by fluorescence background, acquisition variability, and biological heterogeneity, making robust computational analysis essential. This review examines the role of machine learning across the biomedical Raman spectroscopy pipeline, from preprocessing and signal correction to unsupervised structure discovery, supervised diagnosis and molecular stratification, representation and transfer learning, explainability, biomarker discovery, and multimodal integration with imaging, pathology, and molecular profiling. Emphasis is placed on the use of machine learning not only for diagnostic classification, but also for biologically interpretable and clinically actionable analysis. We also discuss the main barriers to clinical translation, including limited dataset sizes, inter-instrument variability, inconsistent preprocessing, insufficient external validation, reproducibility concerns, and limited sharing of software, data, and metadata. We argue that progress will require methodological advances together with standardization, robust validation, explainability, and deployment-ready analytical frameworks. By integrating methodological, biomedical, and translational perspectives, this review outlines key directions for developing reliable and clinically deployable Raman-AI systems. Comments: 52 pages, 2 figures Subjects: Machine Learning (cs.LG) MSC classes: 68T07, 68T05, 62H30, 62R07, 92C55, 62P10, 68U10 Cite as: arXiv:2606.14169 [cs.LG] (or arXiv:2606.14169v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.14169 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-44] Curvature-Guided Geometric Representation for Protein-Ligand Binding Affinity Prediction

链接: https://arxiv.org/abs/2606.14159
作者: Shuai Li,Chuan-Xian Ren,Yuhao Li,Ziqi Huang,Yue Pan,Mingzhe Tang,Hong Yan
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Protein-ligand binding affinity (PLA) prediction is critical in drug discovery. Despite the notable advancements in machine learning-based approaches, existing methods struggle to jointly characterize local geometric organization and globally coordinated cross-molecular interactions, limiting their ability to model complex binding mechanisms. Here, we propose RicciBind, a geometric representation framework that integrates curvature-guided hierarchical structure learning with optimal transport (OT)-based cross-domain alignment to model molecular interactions. Specifically, RicciBind leverages Ricci curvature to capture local interaction tightness within molecular structures, enhancing structural awareness and organizing atomic interactions into curvature-aware hierarchical representations. An OT-based cluster matching mechanism then aligns protein and ligand clusters across heterogeneous domains under geometric constraints, enabling globally consistent correspondences and revealing higher-order interaction patterns beyond local neighborhoods. By coupling curvature-guided structure encoding with OT-driven cross-domain alignment, RicciBind effectively models complex interaction semantics and substantially improves both the accuracy and interpretability of binding affinity prediction. Extensive experiments demonstrate that RicciBind achieved superior predictive performance and generalization across PLA benchmarks and virtual screening tasks. Ablation studies further confirmed the essential role of Ricci curvature in enhancing molecular interaction representations.

[LG-45] rust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops

链接: https://arxiv.org/abs/2606.14149
作者: Muhammad Osama,Maheera Amjad,Zartasha Mustansar,Arslan Shaukat,Muhammad U. S. Khan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in healthcare settings, yet their tendency to hallucinate poses risks when clinical decisions are involved. This study examine whether LLMs recommend recently banned or withdrawn pharmaceuticals when answering clinical questions and tests an agent-based method for reducing such errors. We developed a five-agent “Trust but Verify” system using a single LLM backbone. To measure regulatory knowledge obsolescence, we created an adversarial dataset of 103 clinical MCQs where historically correct answers now refer to banned substances. This scale ensures statistical significance across various therapeutic classes. We evaluated three open-access model families (GPT-OSS, Llama-3, Falcon-3) under vanilla and agentic conditions. Performance was measured via pointwise score, label accuracy, Hallucination Error Rate (HER), and Component Fidelity (CF) score. We also observed clinical safety regression in proprietary models. In default configurations, all models showed high hallucination rates, consistently selecting banned drugs that matched training data patterns. Our proposed agentic architecture reduced HER by approximately 53% across models. Pointwise scores shifted from -0.25 (unsafe recommendation) toward 0.0 (appropriate refusal). The safety audit intercepted dangerous outputs even when models’ parametric knowledge favored the banned substance. The proposed multi-agent framework offers a model-agnostic method for enforcing regulatory compliance that prioritizes patient safety over fluent text generation. Our work demonstrates a practical approach for deploying autonomous AI systems in safety-critical healthcare settings. It shows how real-time regulatory data can be integrated into LLM pipelines to support clinical decision-making.

[LG-46] Decoupled Latent Optimization of Diffusion Models for Full Waveform Inversion

链接: https://arxiv.org/abs/2606.14139
作者: Chen Min,Zheng Ma
类目: Machine Learning (cs.LG)
*备注: 35 pages, 14 figures

点击查看摘要

Abstract:Full waveform inversion (FWI) recovers subsurface velocity from seismic recordings by solving a severely ill-posed, nonconvex PDE-constrained optimization. Classical regularizers stabilize the inversion but fail to reproduce realistic geological structures; recent diffusion-prior methods improve realism at the cost of a fragile trade-off between data fidelity and prior consistency. We propose Decoupled Latent Optimization (DLO), which relaxes the standard latent-optimization formulation into a quadratic-penalty objective over an auxiliary physical variable and a latent variable. The data-fidelity gradient acts in physical space, the diffusion sampler contributes only through a decoded prior sample, and the standard smoothed-velocity initialization of classical FWI is preserved. On the OpenFWI benchmark, DLO outperforms classical regularizers and existing diffusion-based methods under clean, noisy, and missing-trace acquisitions. The prior, trained on 70*70 OpenFWI models, transfers directly to the Marmousi and Overthrust benchmarks, where DLO recovers intricate fault structures and remains robust to initialization smoothing and measurement noise.

[LG-47] DTVEM-RE: A Hierarchical Random-Effects Extension of the Differential Time-Varying Effect Model for Person-Specific Multi-Lag Estimation in Intensive Longitudinal Data

链接: https://arxiv.org/abs/2606.14116
作者: Amartya Bhattacharya
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The Differential Time-Varying Effect Model (DTVEM) of Jacobson et al. (2019) is a popular tool for finding the best time lag in intensive longitudinal data, but it assumes everyone shares the same lag structure. The original authors named fixing this as future work, and it clashes with the premise of modern clinical research, which is that people differ. We present DTVEM-RE, an extension that lets each person have their own lag coefficients, with two versions of the confirmatory step: a discrete-time hierarchical Bayesian VAR in Stan, which pools across people and gives calibrated uncertainty, and a continuous-time per-person Ornstein-Uhlenbeck model in ctsem, which handles unevenly spaced beeps directly. We report four results. A simulation shows the Bayesian version recovers the between-person spread tau_a with bias below 0.01 and coverage of 90 to 93 percent. On the Fisher et al. (2017) EMA dataset (N=40), person-specific lag-1 effects vary by an order of magnitude across three mood items, the Bayesian and GAMM estimates agree closely (r=0.87 to 0.92), and DTVEM-RE gives the best one-step-ahead prediction among four discrete-time methods. A multi-lag version shows all nine tau_k values have credible intervals excluding zero, and the lag where people differ most changes across items, something lag-1-only methods like mlVAR cannot detect. Finally, the two versions agree almost exactly on person-specific lag-1 estimates (r = 0.995), differing only as shrinkage predicts. DTVEM-RE is, to our knowledge, the first person-specific implementation of DTVEM-style lag detection, and it contains standard DTVEM as a special case.

[LG-48] Lyapunov-Based Sample Complexity Analysis for Weakly-Coupled MDPs COLT

链接: https://arxiv.org/abs/2606.14095
作者: Tianhao Wu,Matthew Zurek,Weina Wang,Qiaomin Xie
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注: Conference on Learning Theory (COLT) 2026

点击查看摘要

Abstract:We study the sample complexity of learning in average-reward weakly-coupled Markov decision processes (WCMDPs) and Restless Bandits (RBs) under a generative model. Naive reduction to a tabular MDP leads to high complexity bounds as the state-action space is exponentially large in the number of arms N . By exploiting the weakly coupled structure, we show that near-optimal policies can be learned with sample and computational complexities that are polynomial in N . Specifically, we analyze the plug-in approach, which applies an efficient planning algorithm to an empirical model estimated from data. For fully heterogeneous WCMDPs, we establish the first finite-sample PAC guarantee with polynomial complexity and an O(1/\sqrtN) optimality gap. For homogeneous RBs, we further prove that a smaller optimality gap is achievable under mild structural assumptions. A primary technical contribution of our work is a novel Lyapunov-based analysis framework. Unlike classical approaches that rely on the difficult-to-control bias function, our framework uses an explicitly constructed Lyapunov function along with a drift transfer technique between the true and empirical models. A key step of independent interest in our framework is a fine-grained perturbation analysis for the underlying linear programming (LP) relaxation, which provides a general tool for analyzing LP-based policies and weakly-coupled systems.

[LG-49] Deep Spectral Learning of Embedded Latent Transfer Operators for Stochastic Dynamical Systems UAI2026

链接: https://arxiv.org/abs/2606.14079
作者: Ryogo Tanaka,Yoshinobu Kawahara
类目: Machine Learning (cs.LG)
*备注: Accepted at the 42nd Conference on Uncertainty in Artificial Intelligence (UAI 2026)

点击查看摘要

Abstract:We propose a spectral learning method for stochastic nonlinear dynamical systems represented with embedded latent transfer operators in deep feature spaces. We instantiate the method as Deep Spectral Encoder (DSE), an operator-based latent state-space model in which a time-invariant neural encoder implements learnable nonlinear feature maps from observations, and these features define Markovian latent states whose temporal evolution and observation mapping are described by the transfer and observation operators, respectively. Functional canonical correlation analysis in a learnable Galerkin-projected feature space provides state coordinates from past and future observations, and the two linear operators are estimated on the state coordinates as ridge-regularized closed-form solutions that coincide with Galerkin projections of the associated covariance operators. On this representation, we generalize sequential Bayesian filtering and Koopman spectral mode decomposition in feature space. Experiments on several scenarios show stable and superior performance with sequential Bayesian filtering and dynamic mode decomposition baselines even under noise and partial observability.

[LG-50] Decompose Sparsely Where You Should Absorb Densely Where You Should No

链接: https://arxiv.org/abs/2606.14040
作者: Ruixuan Deng,Zehao Jin,Zekun Wang,Zihan Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are typically trained to reconstruct the \textbfentire residual stream through a sparse dictionary, implicitly assuming that all activation content is amenable to sparse, monosemantic decomposition. We question this assumption and hypothesize that activations contain a low-rank, dense component that is computationally important to the model yet inherently unsuitable for sparse representation, which serves as a major source of the persistent dense latents widely observed in trained SAEs. To test this, we add a small rank- r linear bottleneck in parallel with standard SAEs (BatchTopK and Matryoshka), allowing dense structure to be absorbed before sparse reconstruction. On Gemma-2-2B layer 12, a rank-24 bottleneck reduces dense latent count by up to 84% while improving sparse probing and targeted probe perturbation on both architectures at matched sparsity. The absorbed component is (i) \textbfstructurally identifiable as the top principal components and outlier dimensions; (ii) \textbfcausally necessary, with removing it raising next-token cross-entropy by 7.5 \times , far exceeding the 2.8 \times from removing the geometrically near-identical top-24 PCA directions; and (iii) \textbfredundantly encoded by sparse dictionaries, with ablating 787 maximally aligned sparse features raising cross-entropy by only 2.9 \times and ablating 2,048 topic-aligned features leaving MMLU topic classification virtually unchanged, whereas removing the scaffold drops it from 98.7% to chance. Together, our findings identify a compact, semantically informative and causally important component of residual stream activations (which we term a \textbfcomputational scaffold) that standard sparse dictionaries represent inefficiently, suggesting that the scope of sparsity-based interpretability methods warrants careful re-examination.

[LG-51] Utility-Constrained Policy Optimization

链接: https://arxiv.org/abs/2606.14029
作者: Mehrdad Moghimi,Bernardo Avila Pires
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Constrained MDPs (CMDPs) are a widely adopted framework for incorporating safety into RL agents; however, the framework does not support risk-sensitive constraints. This can be problematic: For example, CMDPs allow for optimal solutions that, in order to satisfy the risk-neutral constraints, mix infrequent catastrophic behaviors and frequent, overly conservative ones. Moreover, prior empirical results suggest that enforcing stricter, risk-sensitive constraints can improve performance even under risk-neutral evaluation. The natural framework to incorporate risk-sensitive constraints is utility-constrained MDPs (UCMDPs), but no practical solutions for this problem existed. In this work, we introduce a simple yet powerful methodology for UCMDPs and constrained RL. Besides allowing for risk-sensitive constraints, our framework does not require us to fix constraint limits in advance of training the agent, provided that a sensible range is known. This increases policy flexibility and, in practice, allows for adjustments to these limits at no extra training cost. Besides benefiting from the generality of the framework, our agent shows strong performance in practice, consistently matching or outperforming existing baselines in several Safety Gymnasium benchmark tasks.

[LG-52] PostDeg: Placement Beats Parameterization in LayerNorm GNNs

链接: https://arxiv.org/abs/2606.14022
作者: Yash Tomar,Aryav Das
类目: Machine Learning (cs.LG)
*备注: Yash Tomar and Aryav Das contributed equally to this work

点击查看摘要

Abstract:LayerNorm-based GNNs routinely erase the topology signals (degree, centrality, k -core) that node-selection policies should depend on, but the literature has not located where in the residual block the erasure happens. We answer that question: a positive per-node scalar inserted before LayerNorm is divided out up to a stabilizer term, while the same scalar inserted after LayerNorm reaches the score head as representation magnitude. The surviving slot is the post-LayerNorm position. We instantiate it with PostDeg, a parameter-free post-LayerNorm inverse-degree scale, and pre-register four falsifiers (graphwise scalars, extra LayerNorm, expressive same-slot capacity, backbone-agnostic source) that would reject the rule. PostDeg gains +3.5%/+2.5%/+5.6% over the LN backbone on influence maximization, network dismantling, and maximum independent set, with 10/10 paired-seed wins per task; none of the four falsifiers fires. The takeaway is that placement, not parameterization, carries the gain – a small invariance check that generalizes to any positive topology scalar in any normalized residual stack.

[LG-53] An Attention-based Model for Robust Forecasting with Missing Modality

链接: https://arxiv.org/abs/2606.13970
作者: Zhitian Zhang,Wenjie Zi,Yunduz Rakhmangulova,Saghar Irandoust,Hossein Hajimirsadeghi,Thibaut Durand
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Work originally done in 2023

点击查看摘要

Abstract:Learning with missing modalities is a fundamental challenge in multimodal robot learning, as real-world robotic systems often operate in environments with incomplete sensor data. Attention-based models are appealing for processing multimodal data because they can handle multiple modalities with a single backbone network. However, most multimodal models assume that all modalities are available during both training and inference, limiting their applicability in robotic perception and decision-making. In this paper, we introduce a multimodal model designed to handle missing modalities during both training and inference. The model is formulated as a conditional variational autoencoder (CVAE) and incorporates a transformer-based architecture that leverages attention mechanisms to learn a unified, fixed-dimensional representation, even when some modalities are missing. We show that our proposed model can be trained with missing modalities while approximating a robust representation of all modalities. We evaluate our approach on five multimodal datasets across two robot learning tasks: human trajectory prediction and robot manipulation forecasting. Experimental results demonstrate that our model effectively learns from incomplete data and is superior to prior multimodal fusion approaches.

[LG-54] Can Machine Learning Forecast Rice Yields in Data-Constrained Settings? Satellite Climate Data National Crop Statistics and Lessons from Sierra Leone

链接: https://arxiv.org/abs/2606.13959
作者: Ibrahim Denis Fofanah
类目: Machine Learning (cs.LG)
*备注: 32 pages, 7 figures. Code and data: this https URL

点击查看摘要

Abstract:Sierra Leone’s agriculture operates with almost no data-driven decision support, and no published machine learning study has examined the country’s crop yields. We ask whether rice yield can be forecast from data Sierra Leone currently has. Using 25 years of FAOSTAT production data (2000-2024) for nine major crops, we train XGBoost, Gradient Boosting, and Random Forest under a strict anti-leakage protocol with expanding-window walk-forward evaluation across seven held-out years, benchmarked against naive persistence. No model trained on crop statistics alone outperforms persistence. Augmenting with free satellite climate data (CHIRPS rainfall, NASA POWER temperature) reverses this result: a climate-only XGBoost reduces forecast error by one third (RMSE 284 vs 428 kg/ha), a gain that holds for a linear model and is robust to excluding the anomalous 2018 season. Early-season (May-June) rainfall is the dominant predictor, implying seasonal yield risk is observable months before harvest. No model anticipated the 2018 collapse, whose origins were institutional rather than climatic. We translate the findings into policy recommendations for Sierra Leone’s Feed Salone Strategy, with a fully open-source pipeline.

[LG-55] Smoothing Dark Areas in Molecular Latent Diffusion

链接: https://arxiv.org/abs/2606.13955
作者: Xi Wang,Jiahan Li,Yuxuan Xia,Yingcheng Wu,Shaoyi Zheng,Shengjie Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Latent diffusion is a promising framework for scalable 3D molecular generation, but it requires a latent space that remains smooth, valid, and navigable beyond posterior samples. Existing molecular VAEs, however, are typically learned through reconstruction-based objectives, which do not guarantee such a latent space. We show that this leads to dark areas: regions of latent space that are reachable during diffusion sampling but decode to disconnected or chemically invalid molecules. Unlike in image generation, molecular decoding requires strict structural and chemical precision, so even small latent perturbations can produce catastrophic failures. We therefore propose TopVAE, a topology-optimized VAE that reduces dark areas by making the decoder internalize structural and chemical constraints during training, eliminating the need for test-time chemical correction. TopVAE greatly improves off-posterior robustness, and when paired with a standard DiT, achieves 77% lower FCD-3D on QM9, the highest VC, 52% lower FCD-3D on GEOM-Drugs, and 1.29\times more stable and connected molecules on zero-shot scaffold inpainting.

[LG-56] Side-Channel Attacks Bypass Protection in 3D Printers

链接: https://arxiv.org/abs/2606.13952
作者: Eric Yocam,Varghese Vaidyan,Micah Flack,Gurcan Comert,Judith L. Mwakalonge
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Active Motor Noise Cancellation (AMNC) ships in commercial fused deposition modeling (FDM) 3D printers as a hardware countermeasure against acoustic side-channel attacks that target intellectual property (IP). We present the first empirical evaluation of a deployed AMNC countermeasure, using a public dataset of synchronized acoustic and vibration recordings from two AMNC-equipped Bambu Lab printers across 12 object classes. AMNC fully neutralizes the acoustic channel: classification accuracy is indistinguishable from the 8.33% random baseline. The vibration channel, which AMNC does not target, still leaks. With summary statistics the leak is coarse and amplitude-driven (vibration accuracy approximately 31% pooled, 36-47% within-printer), while the waveform shape carries essentially nothing (frequency-only features at chance). A full-sequence temporal model that ingests the ordered evolution of the print raises accuracy to approximately 61%, and an order-shuffling control (approximately 33%) shows that a substantial component is genuinely sequential and tied to print progression. The leak is device-specific: a classifier trained on one printer transfers near chance to the other. We conclude that AMNC is an acoustic-only defense: vibration remains a partial, geometry-correlated side channel it does not address, but one that does not, on this dataset, support full geometric reconstruction; reconstruction-grade attacks would require the magnetic or power channels AMNC also leaves untouched. We release all code.

[LG-57] SpikF-GO: Spiking Fourier Graph Operators for Multivariate Time Series Forecasting KDD2026 ECML

链接: https://arxiv.org/abs/2606.13901
作者: Jafar Bakhshaliyev,Niels Landwehr
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 23 pages, 2 figures, 11 tables. Accepted for presentation at ECML PKDD 2026. Code: this https URL

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have emerged as an energy-efficient alternative to conventional neural networks, demonstrating strong performance in computer vision and robotics. More recently, SNNs have been applied to time series forecasting (TSF), with methods exploring spiking temporal backbones, spike-compatible positional encodings, Fourier-domain processing, and redesigned neuron dynamics. However, existing SNN forecasting approaches process variables independently, lacking explicit mechanisms for modeling inter-variable dependencies. This is a critical limitation in multivariate settings, where cross-variable correlations carry substantial predictive information. We propose Spiking Fourier Graph Operators (SpikF-GO), which addresses this gap by combining a hypervariate graph formulation in which every scalar observation becomes a graph node with spike-driven spectral processing. SpikF-GO introduces a Hard Concrete frequency gate for learnable sparse frequency selection and a Complex LIF gate that applies independent spiking neurons to real and imaginary Fourier components, preserving binary, event-driven computation throughout the spectral domain. We further present a variant incorporating Central Pattern Generator-based positional encodings for stronger long-range temporal modeling. Evaluated on eight benchmarks under a unified experimental protocol, SpikF-GO achieves the best average rank among all SNN methods and outperforms its ANN counterpart, FourierGNN, at reduced energy cost. SpikF-GO maintains competitive accuracy even at substantially smaller embedding dimensions, thereby achieving significant energy reductions. To our knowledge, this is among the first works to bring graph-based multivariate modeling into the spiking domain for TSF and the first to provide a unified comparison across SNN forecasting architectures under a common experimental protocol.

[LG-58] A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework

链接: https://arxiv.org/abs/2606.13880
作者: Bright Kwaku Manu,Beckett Sterner,Petar Jevtic
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注: 35 pages, 17 figures

点击查看摘要

Abstract:Accurate estimation of long-term care transition probabilities is central to disability insurance pricing, reserving, and solvency assessment. Classical actuarial multi-state models commonly rely on Markov, semi-Markov, or proportional-hazard specifications, which provide a direct connection to cohort projection but may be restrictive for irregular longitudinal health data with nonlinear aging patterns and heterogeneous covariate histories. This paper develops a well-calibrated estimator of multi-state transition probabilities for irregular longitudinal health data. The model learns from individual health history, incorporates the time elapsed between observations, and conditions transition probabilities on demographic and socioeconomic attributes. It produces a valid probability distribution over the next observed health state, with four possible states: healthy, mild disability, severe disability, and death. Individual probabilities are aggregated by age group and origin state to form transition matrices compatible with actuarial cohort projection. Using longitudinal data from the Health and Retirement Study, we compare the proposed estimator with logistic regression, gradient-boosted trees, a recurrent neural network, and a last-state persistence benchmark. The evaluation considers probabilistic accuracy, endpoint discrimination and calibration for severe disability and death, risk concentration, and transition matrix error after aggregation. The proposed estimator improves severe disability discrimination relative to logistic regression and gradient-boosted tree benchmarks, maintains strong calibration, and yields the lowest transition matrix error among the evaluated models in the held-out test analysis. Results show that a structured machine learning estimator can support long-term care transition modeling when judged by calibration and projection fidelity, beyond discrimination.

[LG-59] Muonp: Muon with Fractional Spectral Powers

链接: https://arxiv.org/abs/2606.13867
作者: Yihe Dong,Will Sawin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Muon is an increasingly widely used optimizer that replaces a gradient G=USV^\top with its polar factor UV^\top , thereby flattening the singular spectrum. However, full flattening discards singular-value information that may matter for adaptation. We introduce Muon ^p , a Muon-style optimizer that instead uses fractional spectral-power updates US^pV^\top for rational p\in(0,1) , interpolating between Muon and gradient descent. To make it practical, we prove that fractional spectral powers cannot be computed by any fixed univariate polynomial iteration, and furthermore derive low-degree odd bivariate recurrences that approximate US^pV^\top using only matrix multiplications, preserving Muon’s matrix-multiplication-only structure and compute complexity. We show that Muon ^p maximizes the linear improvement in loss under the Schatten q -norm for q=1+\frac1p . Empirically, Muon ^p is especially effective for finetuning: on billion-scale models, Muon ^p improves validation perplexity and downstream task performance. We further analyze when Muon ^p is less suitable, through the lens of spectral geometry. Our results reveal important insights on when preserving the singular spectrum can bring significant gains, and introduce a principled way to achieve them.

[LG-60] mporally Consistent Graph Q-Networks for Intelligent Network Control

链接: https://arxiv.org/abs/2606.13848
作者: Zacharias Veiksaar,Maxime Bouton
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures. Accepted to the 6G AI-RAN Workshop at IEEE INFOCOM 2026. The final published version will be available via IEEE Xplore

点击查看摘要

Abstract:Mobile networks continue to grow in complexity and next generation networks are expected to support both increasing traffic loads and more diverse services. As network complexity rises, optimizing antenna parameters under dynamic or changing objectives becomes increasingly challenging. We propose a novel multi-agent reinforcement learning (MARL) algorithm for high-level control and orchestration of mobile networks. The Temporally Consistent Graph Q-Network (TC-GQN) algorithm learns a self-predicting representation of the whole network that is task-independent and aggregates information from all base-stations. A graph neural network is trained using a global reward function to assign coordinated local actions based on the learned encoding of the global network state. We evaluate the algorithm in a simulated environment to orchestrate an energy-saving feature across multiple sectors and multiple carriers under different quality of service (QoS) constraints. The proposed algorithm outperforms state-of-the-art graph-based baselines and a competitive rule-based controller by improving hardware sleep time while maintaining QoS. Moreover, the learned representation enables rapid adaptation to changing intents. Comments: 7 pages, 5 figures. Accepted to the 6G AI-RAN Workshop at IEEE INFOCOM 2026. The final published version will be available via IEEE Xplore Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) Cite as: arXiv:2606.13848 [cs.NI] (or arXiv:2606.13848v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2606.13848 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Approximating Whittle-Matern Fields over Discretized Manifolds

链接: https://arxiv.org/abs/2606.13827
作者: Srinivas Nambirajan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: First draft. 16 pages. 2 figs. May likely change depending upon feedback

点击查看摘要

Abstract:Markovian Whittle-Matérn fields have been convergently approximated by discrete Gauss Markov Random Fields (GMRFs) with sparse precision matrices using a Finite Element approximation of the two-parameter family, [ (\kappa^2 - \Delta)^\alpha/2 u = \mathcalW, ;; \kappa \in \mathbbR, ; \alpha \in \mathbbN. ] of SPDEs. Using recent developements in the analysis of Discrete Exterior Calculus (DEC), we present a different, yet closely related, convergent GMRF approximation to these Matérn fields over complete, boundaryless Riemannian manifolds discretized as well-centered simplicial complexes. This convergent method (i) is agnostic to \alpha, \kappa and thus allows a universal approximation scheme for the precision and covariance matrices of the entire (\alpha, \kappa) -family of GMRFs, so they may be inferred rather than guessed. (ii) inherently models pointwise and piecewise-smoothed measurements of a random field and approximates both equally well (iii) is computationally independent of the interpolants used - it suffers no overhead if one convergent interpolant were replaced with another suitable interpolant over the same mesh. Furthermore, we show that, on discretizations that are well-connected in a precise sense, and volume-concentrated, the precision matrices are spectral functions of a graph-laplacian. We provide a low rank approximator to the family of such Matérn GMRFs and mention a use case: reducing the number of measurements needed to model the GMRF by compressed-sensing.

[LG-62] A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series

链接: https://arxiv.org/abs/2606.13823
作者: Siddharth Pal,Viktoria Rojkova
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: 25 pages, 2 figures, 10 tables

点击查看摘要

Abstract:We study training-free fixed-length descriptors for multivariate time series and ask not merely whether such a descriptor performs well, but when it can be expected to work at all. Our object of study is D(\tau) , built from a time-lagged correlation matrix truncated at the Marchenko-Pastur edge so that only signal-bearing eigenvalues survive and classified by cosine similarity to class centroids with zero learned parameters. The central contribution is not the descriptor but a falsifiable applicability criterion for it. Working from a stationary Gaussian VAR(1) model, we argue that D(\tau) separates two classes when the signals are approximately stationary and the class information lives in their cross-channel temporal coupling rather than in marginal per-channel power. We derive, semi-formally, three consequences: a distinguishability condition, why the static ( \tau=0 ) covariance collapses to chance, and why a stationary but power-discriminated paradigm defeats the descriptor. The criterion is operational: a two-part pre-flight test – an augmented Dickey-Fuller stationarity check and a power-baseline saturation check – predicts applicability before any training. We validate both halves on a mixed assortment. On four paradigms that satisfy the criterion (Sleep-EDF, BCI-IV-2a, MIT-BIH, ESC-50) the descriptor is competitive with strong baselines at a fraction of their cost, reaching 88.5\pm4.5% under 20-subject leave-one-subject-out on Sleep-EDF on a single CPU thread. On three that violate it – non-stationary ERPs, and financial-volatility and wearable-stress regimes that are power-discriminated – it fails exactly as the pre-flight predicts, and these negatives are the more informative half. We are explicit that D(\tau) is not the most accurate representation; its value is a compact, training-free embedding whose domain of validity is known in advance.

[LG-63] Attention-Based Estimation of the Individual Treatment Benefit Probability under Dose Variation

链接: https://arxiv.org/abs/2606.13821
作者: Lev V. Utkin,Andrei V. Konstantinov,Stanislav K. Kogan,Natalya M. Verbova,Maksim I. Goriunov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating the probability that a treatment outperforms a control for an individual patient, called the Individual Probability of Treatment Benefit (IPTB), offers a clinically intuitive alternative to population-average metrics. However, existing methods for IPTB estimation are largely confined to binary treatment settings, despite the prevalence of dose-varying interventions in clinical practice. We propose a general framework for IPTB estimation with ordinal outcomes under discrete dose assignments, called Dose-AIPTB (Dose Attention-based IPTB). Our approach recasts the problem as binary classification over the unobserved sign of the individual treatment effect, constructing pseudo-labels from covariate-similar pairwise comparisons and aggregating them via attention mechanisms or Nadaraya-Watson kernel regression. This formulation naturally accommodates multiple discrete dose levels, extending beyond the binary treatment paradigm. Through numerical experiments on real-world and synthetic data under covariate shift, varying sample sizes, and heterogeneous outcomes, we demonstrate that attention-based aggregation consistently outperforms kernel alternatives. The framework provides a foundation for personalized dose selection grounded in individual-level benefit probabilities. Codes implementing the model are publicly available at this https URL.

[LG-64] Uncertainty Estimation and Generalization Bounds for Modern Deep Learning

链接: https://arxiv.org/abs/2606.13818
作者: Luis A. Ortega
类目: Machine Learning (cs.LG)
*备注: PhD Thesis, Autonomous University of Madrid

点击查看摘要

Abstract:This thesis investigates how Bayesian principles can deepen our understanding of modern deep learning systems. While neural networks achieve remarkable predictive performance, their ability to generalize and to quantify uncertainty remains only partly understood. This thesis approaches this challenge from both methodological and theoretical angles: unifying Bayesian inference, function-space modeling, and large-deviation theory under a common probabilistic perspective. On the methodological side, the thesis introduces the Deep Variational Implicit Process (DVIP), a scalable Bayesian framework that extends implicit processes to deep architectures. Complementing this, two post-hoc methods – the Variational Linearized Laplace Approximation (VaLLA) and the Fixed-Mean Gaussian Process (FMGP) – are proposed to equip pretrained deterministic networks with calibrated uncertainty estimates. The theoretical contributions focus on one of the central open questions in modern machine learning: why do large, over-parameterized neural networks generalize so well? To address this, the thesis develops a unified probabilistic framework that connects three key mechanisms – diversity, smoothness, and stochasticity – within the language of PAC-Bayesian and large-deviation theory. Comments: PhD Thesis, Autonomous University of Madrid Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.13818 [cs.LG] (or arXiv:2606.13818v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.13818 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-65] FlowMo-WM: A World Model with Object Momentum and Hidden Ambient Drift

链接: https://arxiv.org/abs/2606.13817
作者: Yitao Jiang,Luyang Zhao,Muhao Chen,Devin Balkcom
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:World models in robot learning predict future states from visual observations and actions, enabling agents to reason about the consequences of their controls. However, many action-conditioned models are evaluated in settings where motion is dominated by immediate control, whereas aquatic surface vehicles and other real-world objects continue moving under inertia and are displaced by hidden ambient drift, such as water currents or wind. We propose FlowMo-WM, an end-to-end trainable visual world model that infers object-centric motion state and a predictive long-history context associated with hidden drift from image-action histories without direct supervision of flow fields. FlowMo-WM factorizes image-action history into a short-history latent state, trained to summarize object-centric motion, and a longer-history context, trained to summarize slowly varying exogenous influences. A zero-context residual transition separates action-conditioned base dynamics from context-dependent drift effects during latent rollout. In simulated aquatic surface-vehicle environments with diverse hidden flows, disturbances, and randomized vehicle dynamics, FlowMo-WM improves long-horizon rollout accuracy over representative action-conditioned latent world models. Prediction-time context ablations, in which the inferred context is zeroed or shuffled during rollout, show that the ambient context is important for stable prediction under hidden drift, while frozen linear probes characterize information encoded in the learned factors.

[LG-66] Neural Slack Variables for Shape Constraints

链接: https://arxiv.org/abs/2606.13803
作者: Ruben Wiedemann,Antoine Jacquier,Lukas Gonon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Enforcing functional inequality constraints such as monotonicity and convexity in neural networks is a fundamental challenge in many industrial and scientific applications. Classical one-sided penalty methods, along with primal-dual methods gated by complementary slackness, provide constraint gradients only at violated locations, resulting in fragile satisfaction. Architectures that guarantee feasibility by construction, on the other hand, remain largely limited to elementary cases and impose additional inductive biases. We introduce neural slack variables, a deep learning native primal-side approach that converts constraint enforcement into a regression problem by coupling the primary network with a jointly learned auxiliary network. The auxiliary network serves as a valid target for the primary network’s constraint quantities, inducing feasibility and regularity. Neural slack variables achieve zero measured violations on dense-grid monotonicity and convexity test cases, where penalty and primal-dual baselines leave residual violations, and enable arbitrage-free learning of volatility surfaces, an open industrial challenge in quantitative finance.

[LG-67] Neural Variability Enhances Artificial Network Robustness

链接: https://arxiv.org/abs/2606.13801
作者: Robin Preble,Praveen Venkatesh,Stefan Mihalas,Kameron Decker Harris
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Neural responses in cortex exhibit substantial trial-to-trial variability in response to repeated stimuli, while peripheral sensory neurons respond far more consistently, leading many to wonder whether stochasticity may carry meaning. Existing work has argued that noise and signal correlations may be optimized for discrimination in animals, whereas artificial neural network (ANN) studies have shown similar benefits of noise in machine learning tasks, although most ANN work has neglected the effects of correlations. Here we investigate whether correlated noise improves the robustness of artificial neural networks to adversarial attacks and naturalistic image modifications. Using the covariance of activations under modified versus clean inputs, we find that structured noise may significantly improve network robustness. Robustness to naturalistic image modifications benefits most from structure, but this structure transfers poorly across modification types. In contrast, noise structure from adversarial attacks can generalize to other kinds of attacks. These results suggest that structured noise in ANN activations generally improves robustness, establishing a biologically plausible strategy for creating robust artificial neural networks that only relies on local information.

[LG-68] he Program Is Still There: A Conservation Law for Program Discovery

链接: https://arxiv.org/abs/2606.13799
作者: Jorge Miguel Silva
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 9 pages main text and 33 pages supporting information. Engine source and full sweep data: this https URL , archived at doi: https://doi.org/10.5281/zenodo.20634984

点击查看摘要

Abstract:Finding the shortest program that generates a sequence is uncomputable, and for six decades that fact has been mistaken for a wall around finding any generating program. It is not a wall but a price, and this paper measures it. For every algorithm that learns about a candidate program only through its score, a class spanning Levin search, evolutionary methods, simulated annealing, and the cross-entropy method, we define the coupling width of a search problem and prove an unconditional worst-case lower bound, exponential in that width with base one less than the domain size. From it follows a conservation law: structural knowledge injected into a search trades one for one against the search it removes, and their sum can never fall below the length of the program sought. Levin’s 1973 upper bound and the lower bound proved here are the two ends of one conserved quantity, closing on each other as the instruction set grows. The only escape is to read a candidate’s structure rather than its score, and its price, which we prove for generic targets, is incompleteness. A deterministic engine built on this theory recovers a generating program, certified by compressing its data and predicting an unseen continuation, for 2,383 of 3,914 sequences across four independent populations, including 244 of the 256 elementary cellular automata, with measured discovery cost rising along program length more than an order of magnitude inside the score-oracle worst case.

[LG-69] Diffusion Policy Optimization without Drifting Apart

链接: https://arxiv.org/abs/2606.13795
作者: Haozhe Jiang,Haiwen Feng,Pieter Abbeel,Jiantao Jiao,Angjoo Kanazawa,Nika Haghtalab
类目: Machine Learning (cs.LG)
*备注: Project page: this http URL

点击查看摘要

Abstract:RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy-gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double-drift phenomenon: optimizing a variational surrogate can let the ELBO separate from the true log-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected return. We propose \textbfDiPOD, a diffusion policy optimization framework that maintains tight-bound behavior throughout training by interleaving self-distillation with policy-improving gradient updates. This leads to a simple and practical algorithm: augmenting each diffusion policy-gradient update with an on-policy ELBO regularizer. Across diffusion language model post-training and continuous-control diffusion policies, DiPOD substantially stabilizes training and reaches higher rewards than previous methods.

[LG-70] D2H-AD: A Hybrid Model Utilizing Hyperdimensional Computing for Advanced Anomaly Detection

链接: https://arxiv.org/abs/2606.13754
作者: Ghazal Ghajari,Elaheh Ghajari,Ashutosh Ghimire,Saeid Ataei,Faris Alsulami,Fathi Amsaad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection is a fundamental component of intelligent systems with applications in healthcare, cybersecurity, smart grids, and IoT environments. Although conventional machine learning and deep learning methods have demonstrated effectiveness in identifying anomalies, they often rely on large labeled datasets, incur high computational costs, and face scalability challenges in edge and high-dimensional settings. This paper presents D2H-AD, a novel anomaly detection framework based on Hyperdimensional Computing (HDC), a brain-inspired paradigm that represents information using high-dimensional distributed vectors. Unlike existing HDC-based methods, D2H-AD integrates distance-based similarity and density-aware encoding within a unified framework, improving anomaly representation and detection performance. Ablation studies show that hyperdimensional encoding alone yields up to 5.4% higher ROC-AUC than applying the same density-distance scoring directly in the original feature space. Furthermore, D2H-AD consistently outperforms five established baselines, namely HDAD, ODHD, One-Class SVM, Isolation Forest, and Autoencoders, across all evaluated datasets. The framework is lightweight, interpretable, and computationally efficient, making it suitable for resource-constrained and real-time applications. We validate D2H-AD on five benchmark datasets and demonstrate superior F1-score and ROC-AUC performance, together with robustness to class imbalance, noise, and data complexity. In addition to improved accuracy, D2H-AD offers scalability, a small memory footprint, and low-latency operation enabled by binary computations and a compact design. These properties make it particularly attractive for TinyML and edge AI deployments. The proposed framework highlights the potential of HDC for accurate, interpretable, and energy-efficient anomaly detection in dynamic environments.

[LG-71] FedSPC: Shared Parameter Correction for Personalized Federated Learning IJCAI2026 IJCAI’26

链接: https://arxiv.org/abs/2606.13748
作者: Kannanthodath Induchoodan Ajay Menon,Christian Prehofer,Yunfei Xu,Toru Hirano
类目: Machine Learning (cs.LG)
*备注: Accepted for presentation at FL@FM-IJCAI’26, in conjunction with IJCAI 2026. 9 pages

点击查看摘要

Abstract:Personalized federated learning (PFL) is one of the important approaches in federated learning for addressing statistical heterogeneity while enabling client-specific adaptation. Many PFL methods split the model into shared and personalized parameters, which are jointly trained on each client. However, this creates an optimization issue: shared parameters are updated by clients optimizing different local objectives, which can lead to inconsistent shared updates and weaken the shared representation. To address this problem, we propose Federated Shared Parameter Correction (FedSPC), a modular correction method for PFL. FedSPC applies control-variate correction only to the shared parameters of a given PFL method, while leaving personalized parameters unchanged. It can be integrated into three common PFL settings: shared feature extractors, shared classifiers, and fully shared models with local regularization. Experiments on CIFAR-100 and Tiny-ImageNet with ViT, ResNet-34, and VGG-11 show that FedSPC improves performance across representative PFL methods, including FedPer, FedRep, FedBABU, LG-FedAvg, and Ditto.

[LG-72] BigPower: Hierarchical Source-Level Module Power Estimation for CPUs with Large Language Models

链接: https://arxiv.org/abs/2606.13747
作者: Honghua Zhu,Chunjie Luo,Jianfeng Zhan
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Accurate power estimation is important for understanding and optimizing CPU power behavior, yet practical workflows often rely on simulation-derived information or post-silicon analysis. In this work, we present BigPower, a hierarchical source-level surrogate model for fine-grained module-level power estimation during CPU design. BigPower leverages large language model-based representations together with architectural hierarchy, module connectivity, configuration parameters, and workload context to estimate module-level power consumption directly from source-level design information, without requiring additional simulation during inference. Experimental results in the open-source XiangShan processor family demonstrate practical fine-grained power estimation across diverse configurations and workloads, offering an efficient alternative to conventional simulation-based workflows.

[LG-73] High-Frequency Pricing at Scale for E-Commerce

链接: https://arxiv.org/abs/2606.13741
作者: Stefan Birr,Tobias Huelden,Mones Raslan,Adele Gouttes,Andreas Schmitt,Mateusz Koren,Johannes Stephan,Robert Streek,Manuel Kunz,Tim Januschowski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the design, development, and implementation of a specialized forecast-then-optimize algorithmic pricing tool for sales campaigns in fashion e-commerce. Sales events present unique challenges for pricing including volatile demand patterns, rapid pricing decisions, and the need to balance short-term revenue with long-term profitability. We describe our approach combining daily-resolution demand forecasting using gradient-boosted trees with a multi-objective optimization framework that maximizes both long-term profit and net merchandise value for more than 5 million articles. Our solution addresses key limitations of existing weekly-granularity systems by implementing a forecast-then-optimize architecture that reduces pricing decision time from hours to minutes. We validate our approach through 23 A/B tests across 12 markets during 2023-2024 sales campaigns at Zalando, one of Europe’s leading online fashion retailers. Experimental results demonstrate that the new pricing system achieves approximately 6% higher profit while maintaining equivalent performance on sales and revenue compared to the previous manual-algorithmic hybrid approach. Based on these results, the algorithm was successfully deployed to production and now handles the majority of algorithmic pricing decisions for sales campaigns at the company.

[LG-74] Efficient On-Device Diffusion LLM Inference with Mobile NPU

链接: https://arxiv.org/abs/2606.13740
作者: Tuowei Wang,Yanfan Sun,Ju Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion large language models (dLLMs) accelerate generation by denoising multiple tokens in parallel, making them attractive for latency-sensitive mobile inference. However, repeated denoising introduces substantial computation on smartphones. Mobile neural processing units (NPUs) offer high-throughput dense matrix computation, but efficiently exploiting them remains challenging: token commitment shrinks per-block effective workloads, token revision complicates KV cache reuse, and limited NPU-visible address space incurs costly remapping and data transfer overheads. In this paper, we propose this http URL, the first NPU-aware inference framework for accelerating dLLMs on smartphones. this http URL aligns block-wise dLLM inference with the execution characteristics of mobile NPUs through three techniques. (1) Multi-Block Speculative Decoding fills the shrinking workload in late-stage current-block decoding with speculative future-block tokens. (2) Dual-Path Progressive Revision keeps committed tokens revisable until stable and refreshes unstable tokens through a CPU-side path without stalling dense NPU execution. (3) Swap-Optimized Memory Runtime compacts NPU-visible address layouts and overlaps data staging with NPU computation to reduce remapping and transfer overheads. We implement this http URL as an end-to-end framework and evaluate it across diverse hardware platforms and dLLM workloads. this http URL reduces LLaDA-8B generation latency by 17x-42x over the CPU baseline with prefix KV cache reuse, while preserving generation quality. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.13740 [cs.LG] (or arXiv:2606.13740v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.13740 Focus to learn more arXiv-issued DOI via DataCite

[LG-75] How Task Structure Limits Multi-Agent Success: An Information-Theoretic Analysis

链接: https://arxiv.org/abs/2606.13733
作者: Shi Pan,Ming Luo
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) were expected to overcome the limitation of single-agent systems (SAS) through collaboration. However, under typicality conditions on the task’s constraint graph and bounded inter-agent communication, we prove that the success probability of a MAS is closely tied to the connectivity of task constraints, where each agent has limited information-processing capacity. Specifically, the success probability decays exponentially with an information bottleneck that emerges from partitioning the task’s constraint graph among agents. We define this quantity as the \emphminimum cut cost C_\min of the potential constraint graph of each task. This information-theoretic bound applies to both open systems with external feedback and closed systems without. We validate our theory on both synthetic experiments and real-world empirical data from SWE-bench submissions. From our framework, effective MAS design should incorporate task-inherent constraints alongside engineering optimization, and when \Cmin is high, practitioners should restructure tasks rather than simply scaling agents or communication.

[LG-76] Cluster LOCO: Feature Importance For Interpreting Clusters

链接: https://arxiv.org/abs/2606.14592
作者: Claire M. He,Genevera I. Allen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 36 pages, 12 figures

点击查看摘要

Abstract:Clustering is widely used for exploratory analysis and scientific discovery, driving insights from market segmentation to biological data analysis, but its outputs can be difficult to interpret, audit, and reproduce as modern datasets become increasingly large and complex. Reliable use of clustering requires understanding which features drive the discovered structure, yet feature-level explanations for clustering remain scarce compared with methods in supervised learning. Furthermore, existing clustering feature importance scores are often tied to specific algorithms and data assumptions. To address these challenges, we propose Cluster LOCO (Leave-One-Covariate-Out), a family of model-agnostic feature importance scores for clustering. Cluster LOCO is built on feature occlusion and clustering generalizability, defined as whether cluster labels learned on one subset of the data can be accurately predicted on held-out samples. For any chosen clustering algorithm, Cluster LOCO quantifies a feature’s importance by measuring how much its removal degrades generalizability. We first introduce Cluster LOCO-Split, which relies on data splitting, and then extend it to Cluster LOCO-MP, a minipatch ensemble-based version designed for large-scale data. Across synthetic simulations and an application to cell-type discovery in single-cell transcriptomics, we show that Cluster LOCO more reliably recovers informative features than existing clustering feature importance methods.

[LG-77] Free Heavy-Tailed Lunch for Muon: A Theoretical Justification of Empirical Success

链接: https://arxiv.org/abs/2606.14560
作者: Florian Hübler,Thomas Pethick,Suvrit Sra
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Non-Euclidean optimisation methods with matrix-valued updates, such as Muon and Scion, have recently shown strong empirical performance for training Transformer models, yet their theoretical advantages over Euclidean methods remain poorly understood. We address this gap in the heavy-tailed non-convex regime, where stochastic gradients have bounded p -th central moments, p \in (1,2] . We show that certain non-Euclidean methods achieve optimal sample complexity under stronger stationarity measures, while Euclidean methods incur additional dimension-dependent costs. As a consequence, for m \times n matrices, Muon finds an \varepsilon -stationary point in nuclear norm within \mathcalO\left(\min\m, n\ \frac\Delta_1 L\varepsilon^2 \left(\frac \sigma \varepsilon \right)^\frac p p-1\right) samples, absorbing heavy-tailed noise without extra dimension dependence, unlike Euclidean methods. We further prove this sample complexity, including its dimension dependence, is optimal for all first-order methods under nuclear-norm stationarity. Experiments on large language models support our theory. Surprisingly, our results suggest that other Schatten geometries beyond the spectral geometry of Muon can perform competitively in certain settings.

[LG-78] Beyond the Training Distribution: Evaluating Predictions Under Distribution Shift and Selection Bias

链接: https://arxiv.org/abs/2606.14506
作者: Annie Ulichney,Amanda Coston
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Understanding how a prediction model will perform in a new environment before deployment is essential to preventing harm when algorithms inform decision-making. Two common sources of model performance degradation are (i) covariate shift, where the target covariate distribution differs from the source, and (ii) selective labels, where the observability of outcomes depends on historical decisions. We study pre-deployment model evaluation under the joint presence of covariate shift and labeling of outcomes selectively based on observed features. In particular, we present a double machine learning procedure for estimating the target risk of an arbitrary black-box prediction model under a general loss function. We show identification of this estimand under standard assumptions and derive a bias-corrected estimator based on the influence function of the target risk. Finally, we evaluate our estimator through experiments using the eICU electronic health records database, showing that it tracks the true target risk more accurately than methods that address either selective labels or covariate shift alone, as well as baselines that combine standard plug-in approaches.

[LG-79] Machine-learned particle flow as a foundation model for collider physics

链接: https://arxiv.org/abs/2606.14373
作者: Farouk Mokhtar,Joosep Pata,Michael Kagan,Javier Duarte
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Detectors (physics.ins-det)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:The workflow from particle collision to physics analysis passes through a series of reconstruction steps that are traditionally modular and disconnected, with no shared representation linking low-level detector data to high-level analysis tasks. We show that casting event reconstruction as a machine learning problem naturally produces such a shared representation. We repurpose a machine learning model trained for particle-flow reconstruction (MLPF) to perform three distinct analysis tasks: jet flavor identification, jet energy regression, and missing momentum regression. By appending the per-particle latent representations learned during reconstruction as additional input features, we substantially improve over baselines that use kinematic features alone. We further demonstrate that a single linear layer trained using only the latent representations achieves competitive performance against state-of-the-art baseline architectures, and outperforms the baseline for missing momentum regression with approximately 35 times fewer parameters. These results demonstrate that the latent representations learned during reconstruction encode essential physics information needed for downstream analysis, establishing MLPF as a foundation model and offering a concrete step toward an end-to-end pipeline from detector data to physics analysis.

[LG-80] Recovery thresholds for hidden weighted sparse graphs

链接: https://arxiv.org/abs/2606.14335
作者: Zhe Hou,Jingcheng Liu
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 34 pages, 4 figures

点击查看摘要

Abstract:Recovering structural information from noisy high-dimensional data is a fundamental task in statistical inference. We investigate the recovery thresholds for a graph hidden in a randomly weighted complete graph. Specifically, an unknown graph H^* \in H_n is chosen uniformly at random, and hidden in a complete graph of n vertices as follows: the weight of an edge e \in H is distributed independently according to P_n ; otherwise the weight is distributed independently according to Q_n . The goal is to recover almost all of H from these edge weights. Assuming a local Lipschitzness of the Rényi divergence between distributions P_n and Q_n , and a mild density condition for the graphs H_n , we give a unified characterization of the information-theoretic limit for recovering almost all of H (also known as almost exact recovery). Our characterization connects the KL divergence between P_n and Q_n to the logarithm of the first moment threshold of H in the Erdős-Rényi random graph model G(n,p) . Our lower bound also extends to the task of partial recovery, in which only a constant \lambda -fraction of H needs to be recovered. Last but not least, for certain Bernoulli and Exponential regimes, and for Gaussian distributions, we are able to show an All-or-Nothing (AoN) threshold phenomenon at the exponential scale.

[LG-81] Nonlocal Bayesian Modeling of Continuous Spatio-Temporal Dynamics UAI2026

链接: https://arxiv.org/abs/2606.14313
作者: Jaeyeong Lee,Heeyoung Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at UAI 2026

点击查看摘要

Abstract:Real-world spatio-temporal forecasting must handle irregular time points, spatially sparse observations, and the need for uncertainty quantification. This setting is often further compounded by nonlocal interactions (long-range spatial coupling). Modeling continuous-space, continuous-time nonlocal dynamics naturally leads to infinite-dimensional integro-differential equations (IDEs), making principled Bayesian inference intractable. We propose the NonLocal Bayesian Spatio-Temporal model (NLBST), a hierarchical Bayesian framework for continuous spatio-temporal fields that learns explicit nonlocal coupling while retaining tractable inference. NLBST represents the latent field via a coordinate-based spatial basis expansion and models the coefficient process with a continuous-time ODE whose learnable linear operator corresponds to a Galerkin reduction of a nonlocal IDE; a Neural ODE residual captures additional nonlinear dynamics. A linear-Gaussian observation model enables Kalman-style sequential updates under missing and irregular observations, while the spatial basis representation enables inductive prediction at unmeasured locations without retraining. Global parameters are learned via variational inference, and uncertainty is handled through a Bayesian hierarchy. Experiments on synthetic and real-world datasets demonstrate strong forecasting and spatial generalization with well-calibrated uncertainty, yielding substantial gains over baselines in strongly nonlocal and partially observed regimes.

[LG-82] Operator Calculus for Population-Based Optimization: A Mean-Field Convergence Theory

链接: https://arxiv.org/abs/2606.14289
作者: Pekka Malo,Lauri Viitasaari,Patrik Nummi,Antti Suominen,Ankur Sinha,Olli Tahvonen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 71 pages, 4 figures, 2 tables; ancillary files contain Python code reproducing the numerical experiments

点击查看摘要

Abstract:Population-based and distributional optimization methods, from evolution strategies and consensus-based optimization to covariance-matrix adaptation and stochastic gradient methods viewed as distributional dynamics, are widely used for nonconvex or black-box problems, yet their convergence analyses remain fragmented across algorithm-specific techniques. We introduce an operator calculus in which a broad class of such methods, after choosing an appropriate state space and, where necessary, augmenting the state by memory or strategy variables, is described as a composition of three elementary operators (mutation, selection, and recombination) acting on probability measures. Under explicit stability and regularity conditions, the composite operator admits a pre-generator whose continuous-time limit is a transport-reaction-jump (TRJ) PDE that preserves the operator splitting. On this foundation we establish a modular Lyapunov principle. If a state-space Lyapunov function both dissipates under the full generator and controls the relevant search-space gauges, then the state-space Lyapunov functional and the induced search errors decay exponentially. The additive generator structure allows dissipation estimates to be assembled operator by operator, providing a toolkit for certifying convergence of composite mean-field algorithms.

[LG-83] Gradient boosting for extremes: sampling theory and application to insurance

链接: https://arxiv.org/abs/2606.14268
作者: Stéphane Lhaut,Olivier Lopez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 36 pages, 10 figures

点击查看摘要

Abstract:We develop a statistical learning theory for gradient boosting applied to the estimation of covariate-dependent Generalized Pareto (GP) distributions in the context of Peaks-over-Threshold modeling. After an orthogonal reparametrization of the GP likelihood that diagonalizes its Fisher information matrix, we cast the estimation problem within the Empirical Risk Minimization (ERM) framework and derive non-asymptotic error bounds for the boosting estimator. Our analysis accounts for three distinct sources of error in the process: statistical fluctuations, the approximation bias inherent to the asymptotic nature of the GP model-controlled under second-order regular variation-and the approximation error associated with the finite number of boosting iterates, making explicit the resulting bias-variance trade-off. We illustrate the practical benefits of the reparametrization through simulations, showing that it significantly reduces gradient correlation during training and improves convergence stability. The methodology is applied to a medical malpractice insurance dataset from the Texas Department of Insurance, comprising over 18 000 closed claims. The gradient boosting approach yields a good fit for the tail of settlement cost distributions and reveals that the number of days to settlement is the dominant predictor of tail heaviness, consistent with earlier findings in the reserving literature.

[LG-84] Hybrid Uncertainty Sensitivity Analysis Based on the HSIC for High-Dimensional Responses with Aleatory–Epistemic Separation

链接: https://arxiv.org/abs/2606.14053
作者: Shijie Zhong,Jiangfeng Fu,Pengfei Wei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Quantifying the influence of hybrid aleatory and epistemic uncertainties on high-dimensional system responses remains a major challenge in global sensitivity analysis (GSA). Existing Hilbert–Schmidt Independence Criterion (HSIC)-based approaches are primarily restricted to single-output settings and lack a rigorous decomposition of heterogeneous uncertainty sources and their interactions. To address this limitation, a novel double-space tensor-product RKHS framework is proposed for sensitivity analysis under hybrid uncertainty. By constructing factorized kernels over both the latent input space and the multidimensional output space, a concurrent double Möbius inversion is derived to orthogonally decompose the global dependence measure into pure aleatory effects, pure epistemic effects, and their interaction contributions. The resulting dimension-wise sensitivity indices preserve the uncertainty attribution structure across all output dimensions. To satisfy the independence assumptions required by the decomposition, an auxiliary-variable representation based on the inverse probability integral transform is introduced, enabling the treatment of hierarchical uncertainties and Copula-induced correlations within a unified latent space. A fully vectorized single-loop implementation is further developed to avoid the computational burden of nested Monte Carlo simulation. Statistical significance and estimation uncertainty are quantified through permutation testing and Bootstrap confidence intervals. Numerical studies on a modified multi-output Ishigami function and an aerodynamic pressure-field problem demonstrate the accuracy, scalability, and practical applicability of the proposed framework.

[LG-85] Anytime-Valid Confirmation of Label-Shift Corrections ICML2026

链接: https://arxiv.org/abs/2606.14028
作者: Seungjin Choi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICML 2026 Workshop on Hypothesis Testing

点击查看摘要

Abstract:In small-batch scientific deployments, labeled target outcomes may be too scarce for reliable shift estimation even when unlabeled target inputs are available. We address the complementary setting where the practitioner has a pre-specified label-shift correction from domain knowledge and asks whether incoming labeled outcomes support it. We show that the per-observation likelihood ratio between a label-shift-corrected predictive and the source predictive is a conditional e-value, so its running product is a nonnegative martingale and Ville’s inequality yields an anytime-valid confirmation rule. The log martingale equals the cumulative negative log-predictive density (NLPD) gap between the source and the corrected predictive, converting routine model monitoring into a formal sequential test. Rejection means the incoming data support the posited correction relative to the source predictive, but it is not a precise estimate of the degree of shift. Closed forms are available for GP sources with Gaussian label-shift ratios. GP regression simulations validate Type I control, finite-sample power, miscalibration sensitivity, and the small-batch advantage of a reliable prior over label-based re-estimation.

[LG-86] Geometric Domain Adaptation via Optimal Transport for Linear Regression in R2

链接: https://arxiv.org/abs/2606.14023
作者: Brian Britos,Mathias Bourel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Optimal Transport has become recently a powerful method for domain adaptation by aligning source and target distributions. We study a supervised domain adaptation problem where source and target domains are related by a rotation or a translation or a homothety in \mathbbR^2 . We prove that the optimal transport map recovers the underlying map when using a p- norm cost with p \geq 2 . Based on this insight, we develop a method combining K- means and optimal transport to estimate the underlying map, enabling adaptation of linear regression models when target data is scarce. Simulations demonstrate improved performance over baseline methods. Rather than relying on highly expressive deep learning architectures, we focus on classical machine learning models to emphasize interpretability and theoretical insight. This perspective allows us to explicitly characterize the role of optimal transport in recovering geometric transformations such as rotations, translations, and homotheties. Our contributions include a theoretical result linking optimal transport and rotations, translations and homothecies in \mathbbR^2 , and a practical method for adaptation in linear regression offering both conceptual clarity and applied value in domain adaptation tasks in this space. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2606.14023 [stat.ML] (or arXiv:2606.14023v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.14023 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-87] XRDiff: Crystal Structure Prediction from Powder X-Ray Diffraction Data Using Diffusion Models

链接: https://arxiv.org/abs/2606.14003
作者: Nofit Segal,Mingda Li,Benjamin Kurt Miller,Rafael Gómez-Bombarelli
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Determining the crystal structure of a material from its powder X-ray diffraction (PXRD) pattern is a central challenge in materials science. PXRD is an accessible and widely used characterization technique, yet recovering the atomic structure from diffraction data requires solving an underdetermined inverse problem due to the loss of phase information. Generative modeling can provide a prior over atomic structure and learn the mapping from PXRD patterns to crystal structures via simulated structure-spectrum pairs. We present XRDiff, a diffusion model that recovers crystal structures from PXRD given either the stoichiometry or, in a more challenging setting, the elemental constituents and total number of atoms in the unit cell. We evaluate on datasets where each stoichiometry has multiple polymorphs and all polymorphs of a given composition are held out together, ensuring that high performance reflects genuine use of the diffraction signal. XRDiff achieves strong structure recovery rates on simulated benchmarks, indicating that the model learns a spectrum-to-structure mapping precise enough to differentiate between polymorphs. To address generalization to experimental data, we compare a full-spectrum encoding against an encoding based on peak descriptors. The peak-based encoding generalizes substantially better, outperforming even a model trained on full spectra with augmentations fitted to the experimental noise distribution. These results demonstrate that representations robust to the noise and artifacts present in real-world PXRD offer a practical and scalable path toward closing the simulation-to-experiment gap, enabling zero-shot crystal structure solution from experimental PXRD with full or partial chemical composition input.

[LG-88] A General Framework for Decision Trees via Bregman Divergences

链接: https://arxiv.org/abs/2606.13984
作者: Mathias Bourel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Decision trees are one of the fundamental tools in statistical learning due to their interpretability, flexibility, and their ability to adapt to nonlinear structures. Among them, the Classification and Regression Trees, introduced by Breiman, Friedman, Olshen, and Stone in 1984, became one of the most influential algorithms and remains one of the most widely used methods for classification and regression problems. On the other hand, Bregman divergences, introduced by Lev Bregman in 1967 in the context of convex optimization, provide a broad family of loss functions that naturally generalize the squared Euclidean distance. This family includes, among others, the Kullback-Leibler divergence, the Poisson divergence, and the Itakura-Saito divergence, as well as several losses associated with distributions belonging to the exponential family. Moreover, Bregman divergences possess a rich geometric structure and deep connections with convex analysis and information geometry. In this work, we propose a generalization of the CART paradigm based on Bregman divergences, thereby obtaining a broader family of decision trees adapted to different statistical models and underlying geometries. Although algorithms such as CART or classical implementations such as rpart incorporate different impurity criteria, these are usually introduced in an ad hoc manner for each specific model. In contrast, the Bregman divergence approach provides a unified framework that allows these criteria to be derived and interpreted from common convex and geometric principles. Beyond the algorithmic construction, we also investigate theoretical properties of these trees. In particular, we study how properties of the generating convex function – such as strong convexity or smoothness – influence impurity gains between parent and child nodes, as well as stability and consistency properties of the estimator.

[LG-89] Adaptive Nucleus Truncation for Long-Form Reasoning

链接: https://arxiv.org/abs/2606.13982
作者: Ousmane Amadou Dia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sampling plays an important role in long-form language-model reasoning. Over thousands of decoding steps, small changes in the candidate token set can compound into different reasoning trajectories, stability profiles, and final answers. Existing truncation methods such as top- p , min- p , and fixed top- n\sigma sampling improve over unrestricted sampling, but they rely on fixed thresholds that cannot adapt to changes in entropy, task difficulty, training stage, or generation budget. We introduce Adaptive Nucleus Truncation Sampling (ANTS), which extends top-(n\sigma) sampling from a fixed decoding rule into an adaptive rollout-control mechanism for long-form generation. ANTS selects standardized neighborhoods around the maximum logit before temperature scaling, adapts the truncation width using an entropy-conditioned controller, and retains a no-truncation fallback arm to stabilize training when truncation becomes unsafe. On a 33B-total / 4B-active sparse Mixture-of-Experts reasoning model, ANTS improves average performance over percentage-based benchmarks by +1.9, +3.8, and +5.2 points at 8K, 16K, and 32K generation budgets, respectively. The strongest gains appear on instruction following and mathematical reasoning, with IFBench improving by more than 10 points at 32K and AIME 2025 improving by 7 points. Code generation reveals an important budget interaction. On Codeforces, ANTS trails the baseline at 8K, but reverses this gap and substantially improves ELO at 16K and 32K. These results suggest that sampler design should be treated not just as a decoding hyperparameter, but as part of how we stabilize and scale long-budget reasoning.

[LG-90] Classification of Astronomical Spectra Using PCA-Compressed Flux and Inverse-Variance Features

链接: https://arxiv.org/abs/2606.13978
作者: Bruno Santos Meneses Barreto,Marcio Eisencraft
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: This manuscript has been submitted to the Simpósio Brasileiro de Telecomunicações e Processamento de Sinais (SBrT) and is currently under peer review

点击查看摘要

Abstract:This paper evaluates a signal-processing and supervised-learning pipeline for classifying SDSS DR17 astronomical spectra into stars, galaxies, and quasars. Each spectrum is represented by its measured flux and inverse-variance information, combining spectral shape with a wavelength-dependent reliability profile. After resampling onto a common logarithmic wavelength grid, the flux and inverse-variance vectors are standardized and separately compressed using principal component analysis. The resulting components are concatenated and used to train several classifiers. The best performance was obtained with the LightGBM gradient-boosting classifier, reaching 94.6% accuracy and 92.1% balanced accuracy on the test set.

[LG-91] Binary Black Hole Parameter Estimation with Hybrid CNN-Transformer Neural Networks

链接: https://arxiv.org/abs/2606.13941
作者: Panagiotis N. Sakellariou,Spiros V. Georgakopoulos,Sotiris Tasoulis,Vassilis P. Plagianakos
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted manuscript. 12 pages, 10 figures

点击查看摘要

Abstract:The detection of gravitational waves has revolutionized our ability to explore fundamental aspects of the Universe. Traditionally, modeled gravitational-wave signals have been identified using template-based matched filtering, followed by coincidence analysis across multiple detectors in the signal-to-noise ratio time series. Recent advances in Machine Learning and Deep Learning have sparked growing interest in their application to both signal detection and parameter estimation. In this study, a hybrid Deep Learning strategy is proposed that leverages the effectiveness of Transformer encoders alongside well-established Convolutional Neural Network architectures in an attempt to estimate the intrinsic and extrinsic parameters of non-precessing binary black hole systems. The primary focus of this work is point estimation, producing single best-fit values for each parameter rather than full posterior distributions. This method is evaluated on both simulated signals embedded in Gaussian noise and real gravitational-wave events, and it demonstrates strong predictive performance and robustness across key astrophysical parameters.

[LG-92] Direct/adaptive-mixture phase-gradient learning for neural-network quantum states with complex phase structure

链接: https://arxiv.org/abs/2606.13912
作者: Yi-Ran Xue,Rui Wang,Baigeng Wang,Chenan Wei
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注: 24 pages, 8 figures

点击查看摘要

Abstract:Neural-network quantum states (NQS) are a leading variational tool for quantum many-body physics, yet their optimization is fragile whenever the ground state carries a non-trivial sign or complex phase structure, a situation generic to gauge fields, broken time-reversal symmetry, and fermionic statistics. We trace this fragility to the stochastic estimator of the phase gradient rather than to network expressiveness. The phase sector of the Monte Carlo energy gradient is a noisy score-function estimator; differentiating the local energy instead yields a direct estimator that is unbiased for the same phase force, has far lower variance, and requires only a separated amplitude–phase ansatz. Demonstrated on a 100-site flux ladder, a small network trained this way reaches 0.89% median error, where tuned standard baselines plateau at 1.8% and wider or deeper standard-gradient networks degrade from 8.4% to 24.6% . The advantage carries over to chiral XXX chains: the direct estimator again converges to a markedly lower error than the standard one, across \alpha and size; it grows with flux and vanishes in zero-flux controls. An adaptive-mixture of the two estimators is provably never worse in variance than the better endpoint at the optimal mixing coefficient, with seed-resolved diagnostics tracing much of the gain to eliminating failed runs. Estimator design thus emerges as a first-class lever for complex-valued neural quantum states.

[LG-93] Multi-Variable Stellar Parameter Estimation Using Residual Multitask Neural Networks

链接: https://arxiv.org/abs/2606.13868
作者: Bruno Santos Meneses Barreto,Marcio Eisencraft
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: This manuscript has been submitted to the Congresso Brasileiro de Automática (CBA) and is currently under peer review

点击查看摘要

Abstract:We present an end-to-end pipeline for estimating stellar parameters from Sloan Digital Sky Survey Data Release 12 spectra using a fully connected multitask neural network with residual blocks, whose hyperparameters are tuned via Bayesian optimization. The preprocessing pipeline includes per-spectrum standardization, RobustScaler normalization of the target variables – effective temperature T_\mathrmeff , metallicity [\mathrmFe/H] , and surface gravity \log g – and data augmentation via Gaussian noise injection. On a held-out test set, the model achieved Mean Absolute Errors (MAE) of 59.76~\mathrmK for T_\mathrmeff , 0.103~\mathrmdex for [\mathrmFe/H] , and 0.130~\mathrmdex for \log g . Normalized against the full-scale range of each parameter, these results represent range-normalized errors between 1% and 3% , achieved with a highly efficient model complexity of approximately 540,000 trainable parameters. These results demonstrate that a compact residual multitask architecture, combined with principled signal preprocessing, provides a parameter-efficient solution for nonlinear parameter estimation in large-scale spectral datasets. In particular, the proposed model achieves competitive performance with substantially lower complexity than deeper neural network baselines.

[LG-94] Closed-loop discovery of out-of-distribution processing protocols by evolutionary search and uncertainty-aware learning

链接: https://arxiv.org/abs/2606.13859
作者: Yu Liu,Stanislav Udovenko,Ching-Che Lin,Jaegyu Kim,Lane W. Martin,Susan Trolier-McKinstry,Sergei V. Kalinin
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many materials and chemical systems exhibit history-dependent responses, where functional outcomes are governed not only by final-state variables but by the time-dependent sequence of fields, temperatures, or chemical potentials applied during operation. Discovering new processing protocols is therefore a high-dimensional search problem in which the control variable is an entire waveform or sample history, and conventional strategies either remain confined to conservative interpolative families or become prohibitively measurement intensive. Here, a closed-loop workflow is introduced that couples evolutionary search over a compact waveform representation with uncertainty-aware deep kernel learning to generate, rank, and experimentally validate candidate protocols. Applied to ferroelectric thin films, with the scanning-probe tip-bias waveform as the protocol and the nonlinear electromechanical response as the reward, the workflow discovers waveform families that enhance nonlinearity by de-aging the film. Spatially resolved before/after measurements show that the best-performing waveforms selectively activate pre-existing, weakly pinned domain-wall segments, whereas the worst drive long-range irreversible switching. This framework reframes protocol tuning as out-of-distribution discovery, generalizable to synthesis and annealing trajectories, battery formation protocols, and other high-dimensional control problems.

[LG-95] Scalable Deep Unfolding of Conic Optimizers

链接: https://arxiv.org/abs/2606.13825
作者: Alex Oshin,Rahul Vodeb Ghosh,Evangelos A. Theodorou
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep unfolding (DU) accelerates iterative optimizers by introducing learnable components and training them through unrolled iterations, but extending DU to the large-scale semidefinite programs (SDPs) common in robotics has remained limited. Unrolling a full-update conic solver such as COSMO exposes two obstacles that prior work on learned conic solvers has not: backpropagating through the per-iteration linear-system solve incurs memory quadratic in the problem size once the coefficient matrix is formed explicitly, and backpropagating through the positive semidefinite (PSD) cone projection becomes numerically unstable when eigenvalues coincide. We address the first obstacle with a matrix-free implicit differentiation rule that operates entirely through matrix-vector products, reducing memory from O(n^2) to O(n) and enabling backpropagation at scales where direct factorization runs out of memory. We address the second with a backward rule based on the Dalečkii–Krein representation of the Fréchet derivative, which remains well-defined under repeated eigenvalues. Together these make it possible to learn lightweight hyperparameter policies and warm-starts for a full-update conic solver. We evaluate on nonlinear covariance steering problems solved via sequential convex programming (SCP), as well as standalone SDPs and second-order cone programs ranging from max-cut and Lovász \vartheta SDPs to robust estimation and control problems. The learned policies outperform state-of-the-art solvers across all problems, and can provide up to a 50 \times speedup depending on the class. When used as a subroutine in SCP, the learned approach delivers over a 30 \times speedup compared to COSMO.

[LG-96] Recursively Trained Diffusion Models: Limiting Collapse Distribution and Spectral Characterization

链接: https://arxiv.org/abs/2606.13796
作者: Naïl B. Khelifa,Richard E. Turner,Ramji Venkataramanan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recursive training of generative models on their own outputs can lead to model collapse, a compounding drift away from the true data distribution. Existing theoretical works bound finite-round error accumulation in the context of diffusion models, but two questions remain open:~what distribution does the recursion converge to, and how fast? We answer both, isolating a mechanism distinct from imperfect learning: even with perfect score estimation and exact sampling, the early stopping of the reverse diffusion (required for numerical stability) drives a progressive drift away from the data distribution. We prove that this recursion converges geometrically to a unique limiting distribution, which admits a closed-form characterization as an infinite mixture of increasingly Gaussian-smoothed versions of the data distribution. A Hermite spectral decomposition of this limit reveals that recursive training acts as a low-pass filter: higher-order modes, which encode fine non-Gaussian structure, are attenuated much more strongly than coarse modes. This spectral picture motivates annealed truncation schedules that progressively shrink truncation times across retraining rounds; we prove that any schedule converging to 0 asymptotically eliminates recursive compounding. Finally, we show our idealized characterization is robust: in the presence of discretization and score estimation errors, the learned distribution remains in a Wasserstein-2 ball around the ideal limit, with mode-dependent contraction rates that contract high-order errors faster than low-order ones. We validate the theory on synthetic Gaussian mixtures and CIFAR-10.

[LG-97] Conformal calibration and look-elsewhere effect in anomaly detection for new-physics searches

链接: https://arxiv.org/abs/2606.13780
作者: Jack Y. Araz,Michael Spannowsky
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Machine Learning (stat.ML)
*备注: 22 pages, 15 figures, 3 tables. Comments welcome

点击查看摘要

Abstract:Machine-learned anomaly detection is reshaping searches for new physics, but it has outrun the statistics used to interpret it. A raw anomaly score has no calibrated meaning, a model that scans many regions inflates the look-elsewhere effect, and the asymptotic significances the field relies on are blind to the background mismodelling that anomaly detectors are especially prone to. We propose a calibration layer, built on conformal prediction, that turns any anomaly score into a defensible significance with distribution-free, finite-sample guarantees. Conformal prediction converts scores into valid local p-values, weighted and Mondrian variants repair the sideband-to-signal-region exchangeability failures that resonant searches suffer, and a Gross-Vitells step carries the result through to a look-elsewhere-aware global significance. The layer does two things at once. It exposes miscalibration that the standard pipeline cannot see, and it corrects it without retraining the detector. On public LHC Olympics data, a classifier develops a substructure-mass correlation that makes sideband-calibrated background p-values anti-conservative. Taken at face value, this manufactures a \sim 46\sigma excess from background sculpting alone, which the label-free weighted correction removes, restoring an honest null. When run as a blind wide-mass bump hunt, the standard asymptotic and unweighted procedures fabricate \gtrsim10\sigma excesses and \approx5\sigma excesses even in signal-free windows, while the conformal layer raises no false alarms and its global false-positive rate is verified on background-only pseudoexperiments. The result is an auditable, detector-agnostic path from an uncalibrated score to a trials-factor-aware significance, ready to be folded into experimental anomaly searches.

[LG-98] LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models

链接: https://arxiv.org/abs/2606.13709
作者: Yan Hong,Kedong Xiu,Wei Li,Jun Lan,Huijia Zhu,Shuheng Zhou,Zhongcai Lyu,Weiqiang Wang,Jianfu Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study controlled post-training refusal suppression in routed MoE and hybrid-MoE foundation models, aiming to increase non-refusal target-response behavior while preserving general capability under a compact intervention footprint. Existing broad direction-based edits can perturb general-purpose computation, whereas support-only expert edits often lack sufficient capacity to correct heterogeneous refusal representations. To address this limitation, we introduce Localized Multidirectional Correction (LoMC), a support-gated intervention framework that follows a support-then-correction execution order: it first identifies a compact edit support, then aggregates prototype correction directions into layer-wise correction directions, and finally applies rank-one layer-wise correction only within the selected support. By using the edit support as a structural gating constraint, LoMC increases correction capacity without expanding the intervention scope. Experiments on text-only and multimodal safety benchmarks across four routed backbones show that LoMC substantially improves non-refusal target-response behavior while maintaining general capability under a compact intervention footprint.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-06-15

目录

概览 (2026-06-15)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载