Arxiv今日论文 | 2026-06-10

本篇博文主要内容为 2026-06-10 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共115篇(Computation and Language (cs.CL))
人工智能共289篇(Artificial Intelligence (cs.AI))
计算机视觉共123篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共271篇(Machine Learning (cs.LG))
多智能体系统共13篇(Multiagent Systems (cs.MA))
信息检索共20篇(Information Retrieval (cs.IR))
人机交互共44篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] LLM -Mediated Demand Response Coordination in Smart Microgrids

【速读】：该论文旨在解决智能微电网中产消者（prosumer）在策略性自利动机下自愿协同参与需求响应（Demand Response, DR）的协调难题，该问题在结构上等价于社交网络上的重复囚徒困境。其核心解决方案在于提出一种混合式多智能体仿真框架：通过大型语言模型（Large Language Model, LLM）生成结构化的需求响应指令，并由具有异质性的产消者智能体执行，这些智能体采用混合决策架构——即基于收益历史、邻居模仿与利用记忆的博弈论基础概率作为战略推理机制，同时结合LLM对传入协调信号进行语境化叙事评估以实现行为落地。这一设计有效克服了直接使用经人类反馈强化学习（Reinforcement Learning from Human Feedback, RLHF）对齐的LLM作为决策主体时所导致的强合作偏见问题，避免了在不同电网条件下出现动态平庸化的缺陷。实验表明，经结构化指令编译后，需求削减的合作率提升至33.3%，显著优于非结构化消息（27.0%）和无干预基线（28.0%），且优势在多种代理底层模型（真实与理想化）、不同抵抗水平（R = 0.1 至 0.7）及不同网络拓扑中均保持稳定；此外，通过高中心性节点靶向传播指令可进一步提升效果，验证了电网拓扑本身具备独立于信息内容的机制放大作用。研究结果表明，结构化LLM指令生成、基于现实的代理推理机制以及网络感知的目标投放策略是实现可扩展、可解释的城市级能源系统需求响应协同的关键互补原则。

链接: https://arxiv.org/abs/2606.11050
作者: J. de Curtò,I. de Zarzà
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
备注: Accepted for publication in 18th International Conference on Sustainability in Energy and Buildings (SEB-26), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer

点击查看摘要

Abstract:Effective demand response in smart microgrids requires prosumers to cooperate voluntarily under strategic self-interest, a coordination problem structurally equivalent to a repeated Prisoner’s Dilemma on a social network. This paper presents a multi-agent simulation in which a Large Language Model (LLM) Influence Compiler issues structured demand-response directives to a population of heterogeneous prosumer agents, each governed by a hybrid decision architecture combining game-theoretic base probability (derived from payoff history, neighbour imitation, and exploitation memory) with LLM narrative evaluation of incoming coordination signals. The hybrid architecture resolves a key methodological challenge: LLMs aligned via Reinforcement Learning from Human Feedback (RLHF) exhibit strong cooperation bias when used as direct decision-makers, producing flat dynamics regardless of grid conditions. By separating strategic reasoning from grounded narrative evaluation, the model generates realistic prosumer behaviour across six personality archetypes, with baseline cooperation near 50% and clear differentiation under influence. Compiled structured directives achieve 33.3% demand-curtailment cooperation versus 27.0% for unstructured messaging and 28.0% for a no-intervention baseline ( \Delta_\mathrmcomp = +0.063 ), with the advantage preserved across both grounded and idealized agent substrates ( \Delta = +0.083 ) and across all resistance levels ( R = 0.1 to 0.7 ). Hub-targeted dissemination via high-centrality network nodes outperforms peripheral or random targeting, confirming that grid topology provides mechanistic amplification independent of message content. These results suggest that structured LLM compilation, grounded agent reasoning, and network-aware targeting are complementary design principles for scalable, interpretable demand-response coordination in smart-city energy systems.

[MA-1] Decentralized Multi-Agent Systems with Shared Context

【速读】：该论文旨在解决多智能体系统（Multi-agent Systems, MAS）在测试时扩展大语言模型（Large Language Models, LLM）推理能力过程中，因依赖中心化协调机制而导致的通信与集成瓶颈问题。现有方法通常由一个主控智能体负责任务分配、结果收集与整合，随着子任务数量增加，该中心节点成为性能瓶颈。为此，论文提出去中心化语言模型（Decentralized Language Models, DeLM）框架，其核心解决方案在于通过并行智能体、共享验证上下文（shared verified context）和任务队列实现去中心化协同。智能体异步争用子任务，读取累积进展，进行本地推理，并写回紧凑的验证更新；共享上下文作为统一通信基底，使智能体可基于彼此已验证的进展持续推进，无需将每次更新路由至中央控制器。实验表明，DeLM在软件工程测试时扩展与长上下文推理任务中均显著优于现有基线，SWE-bench Verified上平均准确率@1、通过率@2和@4分别提升最高达10.5个百分点，且每任务成本降低约50%；LongBench-v2多文档问答任务中，四种前沿模型族的平均准确率亦达最优，相较最强基线最高提升5.7个百分点。

链接: https://arxiv.org/abs/2606.10662
作者: Yuzhen Mao,Azalia Mirhoseini
机构: Stanford University (斯坦福大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another’s verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at this https URL.

[MA-2] SkillAxe: Sharpening LLM -Authored Agent Skills Through Evaluation-Guided Self-Refinement

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在生成可执行技能文档（Skill Document）时表现不佳的问题，即尽管技能文档是现代智能体框架中的关键组件，但由LLM自动生成的技能往往缺乏实际有效性。现有基准测试（SkillsBench）显示，人工编写的技能可使通过率提升16.2个百分点，而LLM生成的技能则无显著增益。为应对这一挑战，论文提出SkillAxe——一种完全无监督的迭代式自我诊断与优化框架。其核心创新在于将技能质量分解为四个可解释维度：质量影响（quality impact）、触发精度（trigger precision）、指令合规性与故障归因（instruction compliance with fault attribution）、解决方案路径覆盖度（solution-path coverage），并据此生成无需真实标签、测试套件或环境奖励信号的结构化改进建议。实验表明，SkillAxe相较于原始LLM技能实现了28%的相对通过率提升，并缩小了47%至67%的人工技能差距；在真实场景下的SpreadsheetBench验证中，基于SkillAxe构建的技能库通过学习历史智能体轨迹，仅用22个技能便将通过率从16.0%提升至52.0%，证明其具备持续优化能力。因此，该方案的关键在于通过可解释的质量评估维度实现无需外部监督的闭环自我改进机制。

链接: https://arxiv.org/abs/2606.10546
作者: Srishti Gautam,Arjun Radhakrishna,Sumit Gulwani
机构: Microsoft(微软)
类目: Multiagent Systems (cs.MA)
备注: 9 pages, under review

点击查看摘要

Abstract:Skill documents, structured natural-language instructions that guide Large Language Model (LLM) agents, are critical to modern agent frameworks, yet LLMs struggle to write skills that actually work. On SkillsBench, human-authored skills improve pass rates by 16.2 percentage points, while LLM-authored skills provide no measurable gain. We introduce SkillAxe, a fully unsupervised framework that enables LLMs to iteratively diagnose and refine their own skills. SkillAxe decomposes skill quality into four interpretable dimensions (quality impact, trigger precision, instruction compliance with fault attribution, and solution-path coverage), producing structured improvement briefs that require no ground-truth labels, test suites, or environment rewards. On SkillsBench, SkillAxe improves pass rates by 28% relative over unimproved LLM skills and closes 47–67% of the gap to human-authored skills. We validate the approach as a continuous improvement engine in the wild on SpreadsheetBench, where a SkillAxe-built skill library learns from past agent trajectories and raises pass rate from 16.0% to 52.0% using only 22 skills.

[MA-3] Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

【速读】：该论文旨在解决多智能体辩论框架在长期交互过程中因持续扰动导致的逻辑退化、论点重复和角色漂移等问题，这些问题严重削弱了生成过程的稳定性与可信度。现有方法虽能提升最终输出的准确性，但忽视了过程本身的稳健性。其解决方案的关键在于提出一种名为知识引导的反事实推理（Knowledge-Grounded Counterfactual Reasoning, KG-CFR）的双阶段架构，通过严格分离私有检索增强的规划缓冲区（planning buffer）与公开执行层（execution layer），实现“意图-行动”间的结构解耦，从而防止身份丢失并维持过程保真度。在动态不确定性资源分配（DRAU）环境中，该框架在270次随机扰动模拟中将关键后冲击退化率（质量下降Δ ≤ -0.20）控制在5%以下，整体论证质量从0.694提升至0.822。此外，研究引入自定义向量度量以量化话语分歧与计划-执行对齐程度，提供了操作稳定性的方向一致性证据；消融实验进一步表明，恰当的教义性知识锚定在维持论证质量方面与前瞻性规划具有同等重要性。

链接: https://arxiv.org/abs/2606.10475
作者: Jakub Masłowski,Jarosław A. Chudziak
机构: Warsaw University of Technology (华沙理工大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication in the Proceedings of the 30th International Conference on Knowledge-Based and Intelligent Information Engineering Systems (KES 2026)

点击查看摘要

Abstract:Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the process. During long-horizon exchanges reactive systems under sustained perturbations often experience logic degradation, argument repetition, and role drift. To structurally prevent the identity loss and maintain the process fidelity, we introduce Knowledge-Grounded Counterfactual Reasoning (KG-CFR), a dual-stage architecture that enforces a strict separation of concerns between a private, retrieval-augmented planning buffer, and a public execution layer. We assess this system in Dynamic Resource Allocation under Uncertainty (DRAU), a dedicated 1v1v1 environment, introducing diversity as distinct from standard debate settings. Over 270 completely factorial crisis simulation trajectories with stochastic environmental shocks, KG-CFR prevents judge-detected critical post-shock degradation (defined as a quality shift, \Delta \le -0.20 ) in more than 95% of perturbed runs, increasing the overall argument quality from 0.694 to 0.822. Our primary contribution is the demonstration of architectural decoupling being an important factor of systemic resilience enhancement under sustained pressure without quality loss. Furthermore, we introduce custom vector metrics for discourse divergence and plan-execution alignment that provide strong, directionally consistent evidence of operational stability. Our ablation experiments suggest that the proper doctrinal grounding can be an equally important factor for argument quality, as the prospective planning. KG-CFR, according to our initial metric evaluations, reduces semantic looping, by preserving the agent’s consistency with the original plan.

[MA-4] Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多轮交互中因上下文动态演化而面临的提示注入（prompt-injection）与上下文污染（context-poisoning）攻击问题。现有防御机制通常仅对单轮输出进行过滤，忽视了上下文在多轮对话中的演进过程，导致长期推理轨迹易受渐进式恶意片段干扰。针对这一挑战，本文提出博弈论安全的模型上下文协议（Game-Theoretic Secure Model Context Protocol, GT-MCP），其核心创新在于将上下文管理建模为一个闭环动态过程，并引入由控制器驱动的多智能体协同机制。关键解决方案包括：通过信任函数联合评估因果一致性（基于已验证的上下文图）、多代理间的语义一致性以及随时间分布漂移情况，实现对输出的动态筛选；当检测到上下文不稳定时，启动基于回滚的自愈机制，恢复至已验证状态并阻止不可靠片段传播。实验结果表明，在适应性对抗威胁模型下，99.6%的交互回合中上下文漂移被有效控制，仅0.4%需触发恢复；每轮效用高度集中（中位数=-0.19，第5百分位=-0.72，第95百分位=0.30），严重降级（低于-1）事件占比仅0.4%，且无任何注入攻击在控制器层面成功，同时保持稳定胜率超98%及可预测的计算开销（每标记延迟1.63×10⁻³秒）。

链接: https://arxiv.org/abs/2606.10322
作者: Saeid Jamshidi,Amin Nikanjam,Arghavan Moradi Dakhel,Kawser Wazed Nafi,Foutse Khomh
机构: Polytechnique Montréal (蒙特利尔工程学院); Huawei Distributed Scheduling and Data Engine Lab (华为分布式调度与数据引擎实验室)
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) in multi-turn interactions maintain evolving context rather than generating isolated responses, making them vulnerable to prompt-injection and context-poisoning attacks in which locally plausible adversarial fragments gradually distort reasoning trajectories. Existing defenses mainly filter individual outputs and often ignore context evolution across turns, leaving long-horizon reasoning exposed. Although the Model Context Protocol (MCP) standardizes context exchange and tool invocation, it functions as a passive routing layer and does not enforce contextual stability. To address these limitations, we introduce the Game-Theoretic Secure Model Context Protocol (GT-MCP), a controller-driven multi-agent method that treats context management as a closed-loop dynamical process. GT-MCP coordinates three heterogeneous LLM agents and selects outputs through a trust function that jointly evaluates causal consistency against a validated context graph, semantic agreement among agents, and distributional drift over time. When instability is detected, a rollback-based self-healing mechanism restores the validated context and prevents unsupported fragments from propagating. Empirical evaluation over 500 interaction turns under an adaptive adversarial threat model shows that contextual drift remains bounded in 99.6% of turns, with recovery required in only 0.4%. Per-turn utility remains tightly concentrated, with median = -0.19, P05 = -0.72, and P95 = 0.30; severe degradation below -1 occurs in only 0.4% of cases, and no injection attempt succeeds at the controller level. Selected outputs maintain stable win rates above 98%, and computational overhead remains predictable, with latency per token = 1.63e-3 s.

[MA-5] What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

【速读】：该论文旨在解决生成式智能体在复杂环境中的记忆系统如何有效利用空间几何信息以提升记忆召回性能的问题。传统方法将空间邻近性通过线性加权融合至记忆优先级，但实验表明这种默认策略不仅未能提升效果，反而可能损害表现（预注册回忆实验中，混合权重组的平均Delta-Hit@5为-0.0375，p=0.306，处于无坐标感知的基线水平）。其核心解决方案在于确立“几何主导”权重机制：当查询场景具有空间属性时，基于坐标的几何权重显著优于传统方法（+0.3208，p<10⁻¹⁵），证明空间结构应作为记忆检索的首要依据。进一步的关键突破在于区分“记忆召回”与“可见性判断”两个概念——前者需具备遮挡鲁棒性（即使被墙遮挡也能正确回忆下一房间），后者则依赖对存储几何的实时感知计算。研究引入一条重用智能体已有视线射线的射线-体素数字微分分析器（ray-versus-voxel DDA），实现对隐藏目标的精确可见性判断：文本与视野锥体仅能识别0.000的隐藏目标，而结合视野锥体与DDA后准确率达0.982（McNemar检验，p<10⁻⁶）；同时，坐标召回可有效区分近似重复位置，超越余弦相似度（1.000 vs 0.533，n=150）。该可见性判定模块在预注册的实时验证实验（SPMEM-OCC-LIVE-v1）中得到证实，涵盖八种脚本化世界、96个遮挡目标，错误可见率从1.000降至0.000，综合精确McNemar检验结果为p=2.5×10⁻²⁹，且成功暴露并修复了真实通信锚点缺陷。尽管“遮挡需要几何信息”看似显而易见，但本文的核心贡献在于实现了对该假设的量化测量与机制解耦，明确划分了空间记忆必须存储的内容与读取方式，为后续冻结确认研究（SPMEM-ZERO-REAL-PREREG-v1）奠定基础，完整的人类作者多世界盲评研究仍待未来开展。

链接: https://arxiv.org/abs/2606.10299
作者: Doeon Kwon,Junho Bang
机构: Space Zero, Inc.(Space Zero, Inc.)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:Language-agent “memory palace” systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000-0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.

[MA-6] Deployment-Time Memorization in Foundation-Model Agents ICML

【速读】：该论文旨在解决长期运行的生成式AI代理（Generative AI Agent）在部署阶段对用户记忆的管理问题，即如何在个性化服务效能、信息泄露风险与数据删除可靠性之间实现平衡。其核心挑战在于，传统方法仅关注模型参数化的记忆或静态记忆配置的审计，而未系统分析记忆设计选择（如摘要强度、检索广度k值、删除模式）对个性化召回率（Personalization Recall, PR）、对抗性信息提取率（Adversarial Extraction Rate, AER）及删除保真度的联合影响。为此，作者提出将代理记忆建模为一个隐私-效用前沿面，并引入个人化召回率（PR）与对抗提取率（AER）作为关键评估指标，系统考察三个可调设计参数：摘要激进程度、检索范围（k）和删除策略。同时，提出遗忘残留得分（Forgetting Residue Score, FRS）以量化被删除信息在衍生记忆层级中仍可恢复的程度。实验结果表明，在LongMemEval基准上，关键事实摘要可使Gemma 3 12B的“诱饵”信息提取率降低76%、GPT-4o-mini降低64%，且几乎不损失个性化召回；但一旦内容被压缩，增加检索广度k也无法恢复信息泄露。然而，仅基于原始数据的删除方式仍会导致约20%的摘要副本可被恢复，唯有全管道清除或墓碑式擦除才能实现最底层残留归零。因此，研究强调：持久化代理记忆必须作为一级记忆机制进行评估——需综合考量其提升代理记忆的能力、诱发信息提取的风险以及真正实现数据擦除的可靠性。

链接: https://arxiv.org/abs/2606.10062
作者: Lei (Rachel)Chen,Guilin Zhang,Kai Zhao,Dalmo Cirne,Andy Olsen,Xu Chu,Zeke Miller,Alet Blanken,Amine Anoun,Jerry Ting
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 4 pages, ICML MemFM 2026 Workshop

点击查看摘要

Abstract:Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses parametric memorization or audits fixed memory configurations, but does not characterize how memory-design choices jointly shape personalization utility, extraction risk, and deletion fidelity. We study this surface as deployment-time memorization, formulating agent memory as a privacy-utility frontier measured by Personalization Recall (PR) and Adversarial Extraction Rate (AER), and sweeping three memory-design knobs: summarization aggressiveness, retrieval breadth (k), and deletion mode. We further introduce the Forgetting Residue Score (FRS) to quantify whether deleted information remains recoverable from derived memory tiers. On LongMemEval, key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall; critically, once content is compressed away, increasing k no longer restores leakage. The same compression, however, induces a deletion-fidelity failure: raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero. Together, these results establish that persistent agent memory must be evaluated as a first-class memorization mechanism – assessed by what it helps agents recall, what it makes extractable, and what it can truly erase.

[MA-7] Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

【速读】：该论文旨在解决异构机器人团队在非结构化室外环境中因感知不确定性（perceptual uncertainty）导致的场景理解不可靠问题。感知不确定性主要源于遮挡等现象，且在不同机器人视角下表现形式受场景结构影响而异，需结合场景上下文推理与机器人能力感知进行协同决策。现有基于视觉-语言模型（vision-language models）的方法虽具备强大的语义先验能力，但其计算开销大，难以实现在机（onboard）实时推理，且缺乏校准的不确定性量化机制。为此，论文提出Co-GLANCE系统，通过将视觉-语言模型的语义推理能力蒸馏至端到端轻量级模型，实现对遮挡分割与机器人分配的实时在机处理，避免依赖云端计算。为实现可校准的不确定性量化，Co-GLANCE融合置信性预测（conformal prediction）与选择性拒答（selective abstention），为分割、机器人分配及检测输出提供统计上有效的覆盖保证，并以此驱动主动感知，自动调度最优机器人前往信息量最大的视角以消除不确定性。在真实场景测试中，Co-GLANCE在遮挡分割和机器人分配准确率上分别较云端视觉-语言模型基线提升25%和36%，同时将每帧推理延迟降低350倍。研究还发布了首个空地协同的多模态数据集以支持后续研究。

链接: https://arxiv.org/abs/2606.09919
作者: Michal P. Podolinsky,Neel P. Bhatt,Pranay Samineni,Rohan Siva,Christian Ellis,Ufuk Topcu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Code, videos, and dataset available at this https URL

点击查看摘要

Abstract:Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification. We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, eliminating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty. Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame inference latency 350x. We also release an air-ground dataset for future research. Code, videos, and dataset available at this https URL .

[MA-8] RAPS: Therapeutic Response Analysis via Pathway-informed Stratification

【速读】：该论文旨在解决癌症治疗决策中多临床维度（靶向分子治疗、放疗及6个月生存率预测）协同建模的难题，现有路径引导型深度学习模型多为孤立开发与评估，导致不同架构间难以公平比较。其解决方案的关键在于构建首个统一的路径引导治疗反应建模基准，首次在相同数据与评估条件下，联合训练并对比三种生物信息学启发的深度学习架构（BINN、GraphPath、PATH）在五个来自TCGA的癌症队列（共2,622名患者）上的表现，采用Reactome通路活性评分进行编码。研究发现：单一模型无法在所有任务中全面占优——PATH在靶向治疗预测上整体最优，BINN在生存预测中最为稳健，而所有模型对放疗决策均无显著预测能力，表明放疗决策的关键驱动因素为基因表达数据未捕获的临床变量。尤为突出的是，GraphPath在前列腺癌靶向治疗预测中达到0.92的AUROC，为整个基准测试最高分，揭示当样本队列具有狭窄可靶向驱动程序时，通路间的横向共调控结构即使在极端类别不平衡（阳性率仅11%）条件下仍具备卓越判别力。

链接: https://arxiv.org/abs/2606.09898
作者: Sujoy Banik,Sayantan Chakraborty,Boishakhi Das Toma,Zainab Ghafoor,Ushashi Bhattacharjee,Koushik Howlader,Tirtho Roy
机构: Rajshahi University of Engineering & Technology (拉杰沙希工程与技术大学); University of Dhaka (达卡大学); American International University of Bangladesh (孟加拉国国际大学); Sonoma State University (索诺玛州立大学); Iowa State University (爱荷华州立大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Cancer treatment planning requires decisions across multiple clinical dimensions at once. Clinicians must determine whether a patient should receive targeted molecular therapy, radiation therapy, and whether they are likely to survive beyond six months. Existing pathway-informed deep learning models have been developed and tested in isolation, making fair comparison across architectures impossible. We present the first unified benchmark for pathway-guided therapy response modeling, evaluating three biologically informed architectures, BINN, GraphPath, and PATH, across five cancer cohorts drawn from The Cancer Genome Atlas, representing 2,622 patients encoded using Reactome pathway activity scores. Each model is trained jointly on all three clinical outcomes under identical data and evaluation conditions, the first study to treat pathway-structured deep learning as a combined therapy and survival prediction problem. Our results show that no single architecture wins across all tasks: PATH performs best for targeted molecular therapy prediction overall, BINN is most reliable for survival prediction, and no model produces useful predictions for radiation therapy, as the key drivers of that decision are clinical variables not captured in gene expression data. Most strikingly, GraphPath achieves an AUROC of 0.92 on prostate targeted molecular therapy prediction, the highest score in the entire benchmark, demonstrating that lateral co-regulation structure produces exceptional discriminative power when matched to a cohort with a narrow targetable driver programme, even under conditions of extreme class imbalance at only 11% positive prevalence.

[MA-9] Failure Modes of Deep Multi-Agent RL in Asynchronous Pricing: Reproducible Triggers Trace Diagnostics and a Partial Fix

【速读】：该论文旨在解决深度多智能体强化学习（Deep Multi-Agent Reinforcement Learning, MARL）在连续时间定价市场中出现的两种可复现的失效模式：其一为竞争性DDPG智能体间隐性卡特尔（tacit cartel）的形成；其二为高事件率下智能体的“演员-评论家”（actor-critic）架构不稳定性。其核心解决方案在于通过引入异步更新与观测延迟（observation latency δ），缓解卡特尔行为。关键发现是，仅采用异步机制即可使协同指数Δ降低48%，进一步引入延迟可将其降至Δ = 0.28（仍高于伯特兰均衡水平），但该修复措施存在局限性：其效果非单调依赖于δ，且无法应对第二类失效模式——当事件率λ达到5时，评论家网络发生发散，导致相图中(λ=5, δ=1)区域的系统崩溃。研究同时提出轨迹级诊断工具，揭示了智能体在单个回合内信号传递的瓦解及冲击后无法恢复的行为特征。

链接: https://arxiv.org/abs/2606.09884
作者: Shree Murthy,Rohan Pandey
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM)
备注:

点击查看摘要

Abstract:We study two reproducible failure modes of deep multi-agent reinforcement learning in continuous-time pricing markets: (i) tacit cartel formation between competing DDPG agents, and (ii) actor–critic instability at high event rates. We instantiate both inside a single CT-MARL benchmark (Poisson-clocked price updates, observation latency \delta , interior-optimum logit demand), show that synchronous DDPG agents reliably trigger Failure Mode 1 with collusion index \Delta = 0.69 \pm 0.11 , and quantify a partial microstructure fix: asynchrony alone cuts collusion by 48% and adding latency drives it to a minimum of \Delta = 0.28 . The fix has clearly documented costs: it is partial ( \Delta remains supra-Bertrand), it is non-monotone in \delta , and it does not survive Failure Mode 2, which emerges as DDPG critic divergence at \lambda = 5 and corrupts the phase-diagram cell at (\lambda=5, \delta=1) . We accompany the scalar collusion index with trajectory-level trace diagnostics that expose the within-episode signalling collapse and the post-shock non-recovery.

[MA-10] LLM -Based Code Documentation Generation and Multi-Judge Evaluation

【速读】：该论文旨在解决高质量源代码文档在关键领域（如医疗健康）中普遍缺失或质量不足的问题，尤其是在对可靠性与可维护性要求极高的场景下。其核心挑战在于如何自动化生成结构清晰、上下文感知且准确可靠的代码文档。解决方案的关键在于构建一个基于PocketFlow编排框架的AI驱动框架，通过集成八种先进的大语言模型（Large Language Models, LLMs），结合模块化处理流程与优化的提示工程（prompt engineering），实现从代码及代码库中自动生成高质量文档。为确保输出质量并指导模型选型，研究提出了一种名为MultiLLM-as-Judges的评估框架，利用四个独立的LLM在九个维度（如完整性、清晰性、忠实性等）上对生成结果进行多角度评判。实验基于开源医学物理库开展，结果显示顶级模型与表现最差模型之间存在42%的性能差距。通过融合多模型输出、精细化提示设计与严格评估机制，该方法显著提升了文档质量，同时大幅减少人工编写负担，尤其适用于安全敏感型医疗软件系统。

链接: https://arxiv.org/abs/2606.09852
作者: Ikbel Ghrab,Mohamed Dhieb,Ismail Khenissi,Ines Abdeljaoued-Tej
机构: University of Tunis El Manar (突尼斯大学埃尔曼纳尔分校); National Institute of Applied Sciences of Tunis (突尼斯应用科学国立学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: ICAHS, \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates documentation generation from code and repositories using eight state of the art Large Language Models (LLMs), including GPT, Gemini, Qwen, and LLaMA variants. Built on the PocketFlow orchestration framework, the system applies modular pipelines and advanced prompt engineering to produce structured, context aware documentation. To ensure quality and guide model selection, we introduced a MultiLLMasJudges evaluation framework, where four independent LLMs assess outputs across nine criteria, such as Completeness, Clarity, and Faithfulness. Experiments conducted on an open-source medical physics library, demonstrated showed a 42% performance gap between top and bottom models. By combining diverse model outputs, optimized prompting, and rigorous evaluation, our approach enhances documentation quality and reduces manual effort, especially in safety critical healthcare software.

[MA-11] Envisioning Sensemaking in Multi-Human Multi-Agent Collaborative Knowledge Work

【速读】：该论文旨在解决生成式 AI（Generative AI）在协同知识工作中对意义建构（sensemaking）过程的冲击问题，尤其关注其如何改变团队成员之间解释性劳动的分工、信任机制及共享理解的形成。其核心挑战在于，当前生成式 AI 系统在自动执行摘要、整合与主题归类等传统上由人类完成的解释性任务时，往往模糊了贡献来源与推理过程，削弱了协作中的可追溯性与透明度。解决方案的关键在于提出一套面向多人类-多智能体协同意义建构的设计原则：动态多层次信息表征、主动识别并弥合理解鸿沟、对信息的批判性参与、可验证性以及责任可追溯性。基于这些原则，论文构建了一个动态共享表征工作空间的概念框架，通过“伙伴代理”“共享空间代理”和“协调代理”的协同机制，实现人与专用 AI 代理共同收集证据、构建模型、提出假设并推进协作目标，同时完整保留贡献的溯源信息与解释演化路径，从而支持连贯、协商一致的知识建构过程，有效弥补现有生成式 AI 系统在透明性与协作支持方面的不足。

链接: https://arxiv.org/abs/2606.09840
作者: Zhitong Guan,Soo Young Rieh
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: This is the Author’s Accepted Manuscript version of the article: Guan, Z., \ Rieh, S. Y. (2026). Envisioning Sensemaking in Multi-Human, Multi-Agent Collaborative Knowledge Work. Accepted for publication in \textit{Sensemaking @ CHI 2026}

点击查看摘要

Abstract:Sensemaking is central to knowledge work, where people search, evaluate, interpret, and use information over time to construct durable understanding. The rise of generative AI has begun to reshape this process: GenAI systems now perform interpretive functions such as summarization, synthesis, and thematic grouping that knowledge workers have traditionally carried out themselves. In collaborative settings, these shifts compound, complicating how teams divide interpretive labor, trust one another’s contributions, and negotiate shared understanding. In this position paper, we examine how GenAI reshapes sensemaking in collaborative knowledge work and propose five design principles for multi-human, multi-agent collaborative sensemaking: dynamic multi-layer information representations, active identification and bridging of gaps in understanding, critical engagement with information, verifiability, and accountability. Building on these principles, we introduce a conceptual framework for a dynamic shared representational workspace in which knowledge workers and specialized AI agents jointly gather evidence, schematize, hypothesize, and pursue collaborative goals. Through a partner agent, a shared space agent, and an orchestrator agent, the framework preserves the provenance and authorship of contributions and traces the evolution of both individual and shared interpretations, supporting coherent, negotiated knowledge construction that current generative AI systems tend to obscure.

[MA-12] RocketSmith: An Agent ic System for High-Powered Rocket Design and Manufacturing

【速读】：该论文旨在解决高性能火箭设计、制造与优化过程中高度依赖人工经验、流程冗长且难以实现高效迭代的问题。其核心挑战在于如何实现从概念设计到飞行验证的全流程智能化自动化，尤其在复杂多变量参数下的飞行稳定性分析与结构优化方面存在显著瓶颈。解决方案的关键在于构建一个名为RocketSmith的智能体系统（agentic system），该系统通过一系列子智能体（subagents）与技能模块（skills），实现了对软件工具的智能调用与协同，能够自动完成参数化设计生成、飞行稳定性验证及多轮迭代优化。系统支持零样本（zero-shot）与人机协同（human-in-the-loop）两种工作流模式，结合增材制造（additive manufacturing）的自由设计能力，成功开发出四款不同发动机与装配配置的高性能火箭。实验结果表明，所有火箭均实现稳定发射，其中两枚成功回收并具备重复使用条件；基于飞行数据的验证显示，实测顶点高度（apogee）与仿真计算值之间的预测准确率达84%，充分证明了该系统在提升研发效率与可靠性方面的有效性。

链接: https://arxiv.org/abs/2606.00097
作者: Peter Pak,Jesse Barkley,Rumi Loghmani,Derek Baich,Ananya Pamal,Amir Barati Farimani
机构: Carnegie Mellon University (卡内基梅隆大学); Pittsburgh Prefecture One (匹兹堡第一自治市)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This work presents RocketSmith, an agentic system capable of the design, manufacturing, and optimization processes in high powered rocket development. The system enables the intelligent automation of software tools as to not only validate factors such as flight stability but also generate the parametric design components for the rocket assembly. A collection of subagents and skills enable optimization workflows of flight parameters via iteration in both zero-shot and human-in-the-loop workflows. With this system, four distinct high power rockets with various motor and assembly configurations were developed utilizing the unique design capabilities of additive manufacturing. These assembly components were fabricated using various FDM printers, manually evaluated for flight readiness, and flight tested at a launch event. From these tests, all rockets achieved a stable launched and two of the four rockets were successfully recovered in reflyable condition. Within the collected flight data, an 84% accuracy was achieved when comparing measured apogee to that calculated in flight simulations.

自然语言处理

[NLP-0] A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

【速读】：该论文旨在解决监督微调（Supervised Fine-Tuning, SFT）中因严格对齐单一独热目标（one-hot target）而导致的次优问题，尤其当预训练模型已蕴含丰富先验知识时，这种硬性匹配可能忽略上下文语义一致性与潜在噪声。其核心解决方案在于重新将SFT视为目标分布设计（target distribution design）：通过引入Q-target框架，显式分解SFT监督信号为两个关键决策——（1）对观测到的标记的置信度权重，以及（2）剩余概率质量在其他候选标记间的分配策略。该视角统一了多种现有SFT变体作为隐式的目标分布选择，并在此基础上提出Target-SFT方法，直接基于期望的目标分布构建训练目标。实验表明，该方法在十组不同推理数据集-模型组合上均表现更优，验证了以目标为导向的设计范式在SFT中的有效性。整体而言，该研究揭示了SFT训练更根本的设计原则，并拓展了目标函数的探索空间。

链接: https://arxiv.org/abs/2606.11189
作者: Tong Xie,Yuanhao Ban,Yunqi Hong,Sohyun An,Yihang Chen,Cho-Jui Hsieh
机构: University of California, Los Angeles (UCLA); Arena
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.

[NLP-1] Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

【速读】：该论文旨在解决全双工语音对话模型在实际交互中存在的互动性不足问题，具体表现为过度沉默、发言时机不当等行为，这些问题源于现有模型仅通过基于词元级似然最大化的监督学习进行训练，未能直接优化交互层面的行为表现。其解决方案的关键在于提出一种后训练对齐方法，采用强化学习（Reinforcement Learning, RL）全面提升模型在四个核心互动维度——停顿处理、轮次切换、回应性反馈（backchanneling）及用户打断响应——上的表现。针对每个维度，从人类对话语料中提取短音频片段，并设计相应的轴向特定奖励函数进行优化；同时引入基于大语言模型（Large Language Model, LLM）的响应质量奖励，以防止语义退化。该方法在两个开源模型（Moshi 和 PersonaPlex）上验证，显著提升了离线预录音频评估与实时多轮对话场景下的互动性表现。

链接: https://arxiv.org/abs/2606.11167
作者: Atsumoto Ohashi,Neil Zeghidour,Alexandre Défossez,Eugene Kharitonov
机构: Kyutai(京都台); Gradium(格拉迪姆)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

[NLP-2] Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

【速读】：该论文旨在解决生成式后训练流水线中两个长期被忽视的关键问题：一是过滤信号是否基于生成样本所依赖的原始证据（source evidence），二是被拒绝的样本能否被系统性地恢复而非永久丢弃。其核心解决方案在于构建一个受控实验框架，通过对抗性注入的数据集提供真实故障标签，系统评估不同门控配置、恢复策略与生成器规模的影响。研究发现，精确的来源可追溯性可提升强判别模型在忠实性（faithfulness）过滤中的表现；幻觉检测门与奖励模型门拒绝的样本群体高度不重叠，二者缺一不可；而采用故障诊断与定向重生成相结合的自适应恢复流水线，相比简单的重采样方法，在产出率、恢复率和注入样本召回率上均显著更优。尽管下游微调质量主要由生成器规模决定，但过滤与恢复策略仍具有显著的次级影响。

链接: https://arxiv.org/abs/2606.11127
作者: Soham Bhattacharjee,Karun Sharma,Vinay Kumar Sankarapu,Pratinav Seth
机构: Lexsi Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced each generation, and whether rejected samples can be systematically recovered rather than permanently discarded. We present a controlled study of both questions across gate configurations, recovery strategies, and generator scales, using adversarially injected corpora to provide ground-truth failure labels. We find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.

[NLP-3] RACE: A Unified Rollout Budget Allocation Framework for Efficient Agent ic Reinforcement Learning

【速读】：该论文旨在解决生成式人工智能（Generative AI）在多轮代理型强化学习（multi-turn agentic reinforcement learning）中因奖励对比度不足导致的策略优化效率低下问题。具体而言，现有方法在固定采样预算下，仅关注提示词（prompt）层面的样本信息量，忽略了同一回溯轨迹中各轮次前缀（prefix-level）语义差异带来的信息价值差异，从而限制了奖励信号的多样性与有效性。其解决方案的关键在于提出一种名为“基于对比探索的树状回溯分配”（Tree Rollout Allocation for Contrastive Exploration, TRACE）的统一框架，将ReAct风格的每一轮“思考-行动-观察”视为语义独立的节点，构建树状结构的回溯路径。通过在提示根节点和中间前缀层级动态分配采样预算，并利用一个共享的可泛化预测器从历史前缀中估计条件成功概率，以识别最可能产生混合终端奖励的高信息量节点，实现对回溯资源的精细化分配。该机制显著增强了仅依赖最终结果的反馈信号，提升了策略更新的信号强度，实验表明，在等采样成本下，TRACE在典型代理基准测试中实现了显著性能提升，例如在Qwen3-14B多跳问答任务中平均准确率相较基线提升2.8个百分点。

链接: https://arxiv.org/abs/2606.11119
作者: Heming Zou,Qi Wang,Yun Qu,Yuhang Jiang,Lizhou Cai,Yixiu Mao,Ru Peng,Xin Xu,Weijie Liu,Kai Yang,Saiyong Yang,Xiangyang Ji
机构: Tsinghua University (清华大学); LLM Department, Tencent (腾讯大模型部门)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 32 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

[NLP-4] PhantomBench: Benchmarking the Non-existential Threat of Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在面对不存在实体或概念时产生幻觉（Hallucination）的问题，尤其关注模型在高风险领域中因缺乏对自身知识边界认知而生成虚假信息所带来的潜在危害。其核心挑战在于评估模型识别自身知识局限性的可靠性，而现有研究对此尚缺乏系统性衡量手段。为此，作者提出了PhantomBench——首个大规模基准测试，包含超过6万条源自真实概念但虚构的术语与实体，覆盖多个领域。通过该基准对21种不同规模和类型的模型进行评估，研究发现模型在处理虚构概念时幻觉率极高（部分模型平均达86.7%），且前沿模型在输入暗示这些概念存在时仍无法有效拒绝回答。这一结果揭示了当前模型在识别罕见或虚构概念方面存在显著缺陷。解决方案的关键在于利用PhantomBench作为代理工具，用于研究模型在稀有概念上的行为模式，并提供一套可扩展的流水线方法，支持研究人员根据具体需求生成定制化的虚构概念，从而推动对模型幻觉机制的系统性分析与改进。

链接: https://arxiv.org/abs/2606.11105
作者: Haeji Jung,Hila Gonen
机构: University of British Columbia (不列颠哥伦比亚大学); Canada CIFAR AI Chair, Amii (加拿大魁北克人工智能主席，阿米研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.

[NLP-5] he Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

【速读】：该论文旨在解决前沿大语言模型（LLM）在跨语言情境下出现的分布偏移问题，即“方言效应”（Shibboleth Effect），尤其关注在持续对抗性条件下，模型行为是否因语言输入（英语与土耳其语）的不同而产生系统性偏差。其核心问题是：当前主流大模型在多语言外交模拟场景中是否存在语言依赖性的行为不一致，这种不一致性是否可归因于模型架构或训练策略。解决方案的关键在于构建一个名为“青蓝色海危机”的多智能体地缘政治推演系统，通过设计一个结构上类比东地中海领土争端的合成海上冲突场景，对六种前沿模型进行双组间实验（每组10局游戏，每局5轮），以语言为唯一操纵变量，收集并验证了586条行为陈述。研究采用零样本分类器评估模型在“让步率”和“胁迫性修辞”两个连续维度上的行为倾向，发现不同模型表现出截然不同的跨语言响应模式——部分模型（如Llama-4）在使用土耳其语时显著增强胁迫性言辞，而另一些模型（如Gemini-3.1-Pro、DeepSeek-R1）则表现出显著抑制效应，并且后两者提供了链式思维证据，揭示出“链式思维中的制度锚定”与“多语言强化学习人类反馈对齐”两种缓冲机制的存在。这一发现表明，跨语言行为偏移并非西方起源模型的普遍属性，而是由模型内部架构与训练范式所决定，为将大模型安全集成至外交与危机管理场景提供了关键的机制性洞察。

链接: https://arxiv.org/abs/2606.11082
作者: Hakan Mehmetcik
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 25 pages, 2 figures, 6 tables, Research Article

点击查看摘要

Abstract:This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language of play (English versus Turkish), producing 586 validated statements. A zero-shot classifier assesses behavioral dispositions along two continuous dimensions: Concession Rate and Coercive Rhetoric. The results are heterogeneous. Llama-4 shows a substantial, Holm-corrected increase in coercive rhetoric under Turkish (delta = +0.800, p = .002), whereas Gemini-3.1-Pro displays an equally large decrease (delta = -0.750, p = .005). DeepSeek-R1 exhibits a similar negative shift (delta = -0.860, p = .006) and provides chain-of-thought evidence consistent with a buffering mechanism. GPT-4o shows no detectable effect (delta = +0.130, p = .614). These findings indicate that cross-lingual behavioral skew is contingent on model architecture and training regime rather than a universal property of Western-origin LLMs. We identify two distinct buffering mechanisms, chain-of-thought institutional anchoring and multilingual RLHF alignment, and discuss their implications for integrating LLMs safely into diplomatic and crisis-management settings.

[NLP-6] VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

【速读】：该论文旨在解决交互式智能体（Interactive Agent）开发中评估环节的瓶颈问题。现有评估方法多依赖静态基准，难以捕捉智能体行为所具有的动态性与多步骤特性，且无法有效揭示其潜在的失效模式。尽管基于用户模拟的评估方法具有潜力，但现有模拟框架存在两大局限：一是缺乏对模拟交互质量与全面性的量化评估机制，难以判断模拟器是否充分探索了智能体的能力边界与故障模式；二是多数框架仅支持界面（UI）操作或应用接口（API）操作中的单一类型，无法建模真实用户行为的多样性。为克服上述问题，论文提出VISTA——一个面向智能体评估的多功能交互式用户模拟工具包。其核心创新在于构建了一套包含六项指标的评估体系，可从真实性、能力覆盖度和交互有效性三个维度量化模拟交互的质量；同时设计了一种融合UI与API双通道操作的混合用户模拟器，显著提升了在多样化交互环境下的评估真实性和全面性。实验表明，在电商购物与教育客服等场景中，VISTA相较于现有方法能够生成更真实、更全面的评估结果。

链接: https://arxiv.org/abs/2606.11079
作者: Yunan Lu,Ryan Shea,Yusen Zhang,Zhou Yu
机构: Columbia University (哥伦比亚大学); Arklex.ai
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent’s capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.

[NLP-7] A History-Aware Visually Grounded Critic for Computer Use Agents

【速读】：该论文旨在解决当前测试时干预（test-time intervention）方法在复杂图形用户界面（GUI）环境中存在的两大核心问题：一是现有批评模型（critic models）普遍存在短视决策循环，难以有效利用历史交互信息；二是缺乏视觉地基（visual grounding），无法准确识别如误点击错误UI元素等执行层面的错误。针对上述问题，论文提出了一种名为HiViG（History-aware Visually Grounded）的测试时框架，其关键在于构建一个基于真实GUI轨迹训练的多模态批评模型，能够将历史交互抽象为紧凑的记录，并结合视觉地基对动作进行评估。在测试阶段，HiViG通过将批评模型嵌入策略决策循环，提供宏观动作历史（macro-action history）以缓解短视规划问题，同时引入视觉地基批判（visually grounded critique）机制，基于当前屏幕截图验证原始执行坐标，从而在动作执行前及时拦截错误。实验结果表明，该方法在网页、移动和桌面多个基准上均显著优于现有标量与语言型批评模型，在最强基线基础上分别提升了5.8%（Qwen3-VL-32B）和9.0%（Gemini-3-Flash）的平均成功率，并展现出优异的跨平台泛化能力。消融实验进一步证实，宏观动作历史与视觉地基批判均为长时程GUI任务中实现测试时扩展的关键组件。

链接: https://arxiv.org/abs/2606.11078
作者: Jaewoo Lee,Zaid Khan,Archiki Prasad,Justin Chih-Yao Chen,Supriyo Chakraborty,Kartik Balasubramaniam,Sambit Sahu,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal
机构: University of North Carolina at Chapel Hill; Capital One; University of Texas at Austin
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy’s completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

[NLP-8] Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在复杂人格情境下行为理解与控制的问题，尤其关注人格诱导（personality induction）对模型生成内容的影响及其动态交互机制。其核心挑战在于如何有效实现单人格、多人格组合及人格切换条件下的稳定可控生成，同时评估不同人格设定对模型性能的潜在影响。解决方案的关键在于提出显式人格条件化（explicit personality conditioning）方法，并构建涵盖单人格诱导、多人格诱导与人格切换的系统性评估框架。实验表明，人格诱导虽能提升图像描述生成质量，但在需要精确推理的任务（如视觉问答，Visual Question Answering, VQA）中可能产生性能下降；在多特质组合与动态切换过程中观察到的平衡效应与残余效应，揭示了模型行为受到先前与当前人格约束的协同调制。此外，现有基于提示（prompt-based）的人格诱导方法在多模态场景中表现出有限的可迁移性，凸显了针对多模态环境定制化人格建模与评估方法的必要性。

链接: https://arxiv.org/abs/2606.11074
作者: Peiqi Jia(1),Haonan Jia(2),Ziqi Miao(2),Linkang Du(1),Yuntao Wang(1),Zhou Su(1) ((1) Xi’an Jiaotong University, (2) Beihang University)
机构: Xi’an Jiaotong University (西安交通大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality conditioning and establishes a systematic evaluation framework encompassing single-personality induction, multi-personality induction, and personality switching. Experiments show that personality induction improves image captioning performance but can impair performance on tasks requiring precise reasoning, such as visual question answering (VQA). Balancing and residual effects are observed during multi-trait composition and dynamic switching, indicating that model behavior is co-modulated by both previous and current personality constraints. Existing prompt-based personality induction methods show limited transferability to multimodal settings. Our work reveals the dynamic and complex nature of personality modeling in MLLMs and underscores the need for robust, tailored methods for personality induction and evaluation. The code will be released when the paper is accepted.

[NLP-9] 1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

【速读】：该论文旨在解决现有代理系统（agentic systems）评估基准在任务复杂度、现实性及领域多样性方面的局限性，尤其针对跨领域交互、多步推理与协同协作能力的评估不足问题。其解决方案的关键在于提出T1-Bench——一个高保真、全面的基准测试体系，用于在模拟真实客户交互场景下评估代理系统的性能。T1-Bench通过包含25个不同难度领域的交错式任务场景，显著提升了任务的组合复杂性与评估严谨性，要求模型在多轮用户-助手交互中进行结构化推理，并有效调用工具完成复杂任务。研究采用12种专有及开源大语言模型进行评估，结合自动化指标与人工判断，构建了可复现、标准化的评估框架，从而更真实地衡量代理在多步骤、跨域环境中的行为表现、工具使用效率与对话质量。该工作为未来代理系统的研究提供了开放的数据与代码支持。

链接: https://arxiv.org/abs/2606.11070
作者: Genta Indra Winata,Amartya Chakraborty,Yuzhen Lin,Swasthi P Rao,Shikhhar Siingh,Houhan Lu,Nadia Bathaee,Sriharsha Hatwar,Paresh Dashore,Anmol Jain,Kshitij Tayal,Xiuzhu Lin,Anirban Das,Sambit Sahu,Shi-Xiong Zhang
机构: Capital One
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and often fail to capture interactions that span multiple domains, limiting their ability to evaluate agents in realistic multi-step settings that require sustained reasoning and coordination. To address these limitations, we introduce T1-Bench, a high-fidelity, comprehensive benchmark for evaluating agentic systems in realistic customer-facing, multi-domain environments, featuring interleaved scenarios that require structured reasoning across multi-turn user-assistant interactions and substantially increasing both compositional complexity and evaluative rigor across 25 domains of varying difficulty. We evaluate T1-Bench using 12 proprietary and open-weight models, providing a reproducible and standardized framework for assessing agent behavior, tool utilization, and conversational quality in complex, multi-step environments. We further complement automatic evaluation with human judgments to strengthen the assessment of qualitative performance. Overall, T1-Bench substantially advances prior benchmarks by increasing task complexity, interaction depth, and domain coverage in simulated multi-domain environments. To facilitate future research on agentic systems, we will publicly release data and evaluation code as open source.

[NLP-10] Attention Amnesia in Hybrid LLM s: When CoT Fine-Tuning Breaks Long-Range Recall and How to Fix It

【速读】：该论文旨在解决链式思维（Chain-of-thought, CoT）监督微调（SFT）在混合线性注意力模型中导致长上下文召回能力显著退化的问题。研究表明，尽管CoT-SFT能有效提升模型的推理能力，但其会系统性地诱导注意力梯度偏向短程模式，从而破坏查询-键投影矩阵（W_Q, W_K），进而影响长程路由机制，导致在“针堆找针”（Needle-In-A-Haystack, NIAH）等长上下文检索任务中的性能急剧下降，例如HypeNet-9B在256K上下文长度下的准确率从67.2%降至9.4%。为此，论文提出QK-Restore——一种无需重新训练的方法，仅从微调前的检查点中恢复W_Q与W_K参数，同时保留所有其他经微调后的参数。为进一步平衡长程路由保持与推理适应性，引入了一种Procrustes变换变体。实验表明，QK-Restore在多种架构上均以零训练成本有效恢复了长上下文能力，且不损害推理性能；例如，在HypeNet-5B上，其在S3@256K任务中的准确率由65.4%提升至76.4%，充分验证了该方法的有效性与普适性。

链接: https://arxiv.org/abs/2606.11052
作者: Xinyu Zhou,Boyu Zhu,Yi Xu,Zhiwei Li,Yingfa Chen,Huiming Wang,Zhijiang Guo
机构: LARK, HKUST(GZ)(港科大(广州)智能计算与机器人实验室); UCL(伦敦大学学院); Mistral AI(米斯特拉尔人工智能); Tsinghua University(清华大学); SUTD(新加坡科技设计大学); HKUST(香港科技大学)
类目: Computation and Language (cs.CL)
备注: 28 pages

点击查看摘要

Abstract:Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from 67.2% to 9.4% . We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections ( W_Q, W_K ) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only W_Q and W_K from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from 65.4% to 76.4% while maintaining strong reasoning performance.

[NLP-11] Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

【速读】：该论文旨在解决生成式大模型在通过后训练（post-training）转化为推理模型过程中，是否能够有效保持其原始指令对齐（instruction alignment）行为的问题。尽管当前主流方法侧重于提升多步任务的推理准确性，但往往忽视了对安全拒绝、偏见规避及隐私保护等关键对齐特性的保留。研究发现，这种转化过程默认并不具备行为保真性，通过对六项可信度维度（安全性、毒性、刻板印象与偏见、机器伦理、隐私保护、分布外鲁棒性）的系统评估，结果显示：尽管推理模型在标准推理基准上表现更优，却普遍存在对齐退化现象，如毒性增强、刻板印象放大、拒绝行为校准失衡以及上下文隐私泄露等问题。这些退化现象与指令微调基线模型之间的行为漂移（以KL散度衡量）高度一致。因此，论文提出的关键解决方案是：必须将可信度评估指标作为推理模型评价的核心组成部分，与推理能力提升一同报告，以确保模型在性能优化的同时不牺牲安全性与对齐性。

链接: https://arxiv.org/abs/2606.11046
作者: Prajakta Kini,Avinash Reddy,Souradip Chakraborty,Satya Sai Srinath Namburi GNVV,Furong Huang,Amrit Singh Bedi,Alvaro Velasquez
机构: University of Colorado Boulder(科罗拉多大学博尔德分校); University of Central Florida(中佛罗里达大学); University of Maryland College Park(马里兰大学学院公园分校); University of Wisconsin-Madison(威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.

[NLP-12] AuRA: Internalizing Audio Understanding into LLM s as LoRA

【速读】：该论文旨在解决现有大语言模型（LLM）扩展至语音输入时所面临的三大核心问题：基于级联式语音识别（ASR）-LLM流水线的方法存在转录接口延迟，端到端语音-语言模型需要高昂的多模态训练成本，而桥接或知识蒸馏类适配方法则受限于语音与语言之间的串行耦合。其解决方案的关键在于提出AuRA（Audio Representation Adaptation），通过轻量级音频嵌入层将语音编码能力以知识蒸馏的方式内化至LLM中。具体而言，AuRA同时将同一语音输入送入预训练的ASR编码器（教师模型）和经LoRA微调的LLM（学生模型），并采用分层蒸馏策略，对齐学生模型各层隐藏状态与教师模型对应表示，从而实现语音表征在低秩适配的LLM侧的高效内化。该方法既避免了大规模多模态训练，又实现了语音与语言的紧密联合建模与并行端到端推理，在多个语音-语言基准测试中均展现出优于级联系统、传统适配基线及大型多模态模型的性能与效率。

链接: https://arxiv.org/abs/2606.11033
作者: Bo Cheng,Lei Shi,Zhanyu Ma,Yuan Wu,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He
机构: Meituan(美团); Jilin University(吉林大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student’s hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.

[NLP-13] Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer

【速读】：该论文旨在解决在自然生成的社交媒体文本中对主观构念（如人类基本价值观）进行可扩展、可迁移测量的问题，其核心挑战在于如何确保标注过程兼具理论基础、实证有效性，并能与编码器模型兼容以实现大规模预测。解决方案的关键在于：基于舒瓦茨（Schwartz）的人类基本价值理论构建具有约束力的价值定义，以此减少因文本多义性导致的虚假价值归因；通过迭代式提示校准（prompt calibration）结合错误分析优化大语言模型（LLM）的输出，提升与专家标注的一致性；从重复出现的错误模式中提炼出针对性的专家验证规则，用于指导语料标注；最终通过软标签训练（soft-label training）将LLM的标注结果有效迁移至编码器模型，保留理论驱动的价值解释及表达不确定性信息。这一方法实现了从理论到可计算模型的闭环转化，显著提升了价值识别的准确性与可解释性。

链接: https://arxiv.org/abs/2606.11018
作者: Maria Milkova,Maksim Rudnev
机构: University of Waterloo (Waterloo大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Measuring subjective constructs in naturally occurring social media text requires annotation procedures that are theoretically grounded, empirically validated, and transferable to an encoder model for scalable prediction. Using non-English social media posts annotated according to Schwartz’s theory of basic human values, we investigate how different LLMs, prompts, and instruction languages operationalize the expression of values in text. We argue that although texts may permit multiple plausible interpretations, theory-based value definitions can constrain interpretations and reduce spurious value attributions. Beyond precision, recall, and F1, we evaluate structural alignment between values, error structure, confidence-ambiguity relations, and annotation stability. We show that different LLMs produce different value interpretations. Iterative prompt calibration through error analysis reduces misattributions and improves alignment with expert annotations. We also derive targeted expert verification rules from recurrent error structures and use them during corpus annotation. Finally, we show that LLM annotations can be transferred to an encoder model through soft-label training, retaining theory-based value interpretations and information about uncertainty in value expression.

[NLP-14] Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions

【速读】：该论文旨在解决大语言模型在将数学应用题进行多语言适配以支持个性化学习时，其文化适应性的一致性、文化多样性保有程度以及模型对文化要素的显著性识别能力等问题。研究发现，尽管不同模型（Claude Opus 4、GPT-4.1 和 Gemini 2.5 Pro）在将60个英文数学应用题转换为包括高资源语言（如意大利语、印地语）到低资源语言（如信德语、西西里语、旁遮普语）在内的多种语言时表现出一定共性，但模型间在实体转换类型上的共识率仅为62.5%，具体替换方案的一致性更低至33.5%，表明模型选择直接影响学生所接触的文化语境。关键问题在于：所有21种语言-模型组合均出现熵压缩现象，即文化多样性被压缩而非扩展；模型过度依赖表面文化标记（如姓名、食物、货币），而忽视深层结构性文化特征（如年级制度所隐含的文化假设）。此外，模型常在未明确指定目标国家的情况下错误分配区域背景（如为印度孟加拉裔学生使用孟加拉塔卡），并产生跨文化污染（如将复活节寻蛋活动改编为开斋节活动）。这些系统性偏差中，部分可从单个翻译中察觉，但文化多样性衰减、表层标记偏好及持续的区域误判等深层缺陷仅通过语料库层面分析才能揭示。这种表面合理性掩盖了深层次的文化失真，使得问题更难被觉察。

链接: https://arxiv.org/abs/2606.11009
作者: Parisa Suchdev,Juniper Lovato
机构: University of Washington (华盛顿大学); Fred Hutchinson Cancer Research Center (弗雷德·哈钦森癌症研究中心)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 17 pages total with references and appendix, 9 figures, under review

点击查看摘要

Abstract:Large language models are increasingly used to adapt math word problems for personalized learning at scale, but it remains an open question whether those adaptations are consistent across models, preserve cultural diversity at scale, and reveal which cultural entities models treat as most salient. We analyze how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into Bengali, Hindi, Punjabi (India), Urdu, Sindhi (Pakistan), Italian, and Sicilian (Italy), a language set spanning the full resource spectrum, from high-resource Italian and Hindi to under-studied Sindhi, Sicilian, and Punjabi. We annotate 6,489 entity transformations, coding whether models preserve, localize, generalize, omit, or change entities such as names, foods, and places. Models agree on transformation type in 62.5% of cases and on specific substitutions in only 33.5%, meaning model choice directly shapes which cultural world students encounter. All 21 language-model combinations show entropy collapse, with adaptation compressing rather than expanding cultural diversity. Models prioritize surface markers such as names, foods, and currencies while preserving deeper structural features such as grade-level systems that embed culturally specific assumptions. Despite prompts specifying target countries, models misattribute regional context by using Bangladeshi taka for Indian Bengali students and produce cross-cultural contamination, such as adapting egg hunts as Eid activities. Some failures are visible in individual translations. Others, including diversity collapse, systematic preference for surface markers, and consistent regional misattribution, emerge only through corpus-level analysis. The surface plausibility that makes adapted problems look correct is precisely what makes deeper failures easy to overlook.

[NLP-15] Mind the Gap: Can Frontier LLM s Pass a Standardized Office Proficiency Exam?

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）在复杂专业级生产力软件（如Office套件）中进行文档自动化操作时能力不足的问题。现有研究虽推动了生成式AI在代码生成方面的进展，但其在需要长期规划、精确参数配置及多应用集成的办公自动化任务中的实际表现仍缺乏系统性评估。为此，研究提出以中国全国计算机等级考试（National Computer Rank Examination, NCRE）为基准，构建包含Word、Excel和PowerPoint的200个综合性实操任务的数据集，采用7,118项可机器评分的标准对每项任务进行100分制评分，通过得分率（Score Rate, SR）量化模型整体表现。实验结果表明，单轮交互模型最高仅达36.6%的得分率，即便采用具备执行反馈、迭代修复与更广范围Office访问能力的智能体系统，最高得分也仅为68.8%，远低于95.5%的人类参考得分。这揭示出尽管生成式AI在代码生成方面取得进展，但在实现可靠细粒度办公文档自动化方面仍面临重大挑战。解决方案的关键在于构建高保真、可量化、覆盖多场景的基准测试体系，并引入具备闭环反馈与多步推理能力的智能体架构以提升系统鲁棒性。

链接: https://arxiv.org/abs/2606.10956
作者: Tengchao Lv,Dongdong Zhang,Jiayu Ding,Yilin Jia,Yuzhong Zhao,Yupan Huang,Wenshan Wu,Xiangyang Zhou,Shaohan Huang,Nan Yang,Li Dong,Lei Cui,Furu Wei
机构: Microsoft Research(微软研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China’s National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.

[NLP-16] Density Field State Space Models: 1-Bit Distillation Efficient Inference and Knowledge Organization in Mamba-2

【速读】：该论文旨在解决大模型压缩与高效推理之间的矛盾问题，即如何在显著降低模型体积和计算开销的同时，保持下游任务性能。其核心解决方案是提出密度场状态空间模型（Density Field State Space Models, DF-SSM），通过将状态空间模型（SSM）压缩至1比特骨架（scaffold）并引入int8低秩修正项，实现极致的模型紧凑性与高速推理。关键创新在于：采用仅需3200万条训练数据及单个A100 GPU上6小时的轻量级知识蒸馏过程，即可将1.3B参数的Mamba-2模型压缩至278 MB（相比原FP16版本缩小9.7倍），并在GPU上实现21.4倍的推理加速。同时，该框架设计了优化的推理流水线，融合cuBLAS INT8张量核心、定制化CUDA核函数以及AVX-512 CPU后端，支持跨硬件平台高效部署。进一步分析揭示模型内部存在三个明确的处理阶段：意图分类（前3层，运行于无词汇对齐的抽象空间）、知识检索（第25–35层，事实关联局域化于5层窗口内）与输出格式化（第36–47层，类别结构消解），表明尽管事实记忆能力较弱，但模型具备高度组织化的表征结构，提示表征结构可能先于事实准确性形成。

链接: https://arxiv.org/abs/2606.10932
作者: Chirag Shinde
机构: Independent Researcher
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 6 figures, 7 tables. Code available at this https URL

点击查看摘要

Abstract:We present Density Field State Space Models (DF-SSM), a framework for compressing SSMs to a 1-bit scaffold with int8 low-rank correction. Applied to Mamba-2 1.3B, we achieve a 278 MB model (9.7x smaller than the 2.7 GB FP16 teacher) that runs at 21.4x faster inference on GPU (batch=1, relative to the mamba-ssm reference implementation) while maintaining downstream task performance within 2-4 percentage points of BitMamba-2, a 1.58-bit model trained from scratch on 150B tokens. The distillation itself requires only 32M tokens and 6 hours on a single A100 GPU, though it presupposes a pretrained FP16 teacher. We develop an optimized inference pipeline combining cuBLAS INT8 tensor cores for the scaffold matmul, custom CUDA kernels for stateful SSM and convolution operations, and an AVX-512 CPU backend for efficient deployment on both GPU and CPU. Beyond compression, we investigate the internal knowledge organization of the resulting model, discovering three distinct processing phases: intent classification (layers 0-3, operating in an abstract space with no vocabulary alignment), knowledge retrieval (layers 25-35, where factual associations localize to a 5-layer window), and output formatting (layers 36-47, where category structure dissolves). Through systematic analysis of 445 factual prompts across 19 categories, we find that early-layer classification is syntactic (driven by template structure) rather than semantic, and that the model exhibits well-organized knowledge representations despite weak factual recall–suggesting that representational structure may precede factual strength.

[NLP-17] It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

【速读】：该论文旨在解决现代大语言模型（Large Language Models, LLMs）在经过大规模后训练（post-training）对齐以确保公平可靠行为的前提下，其安全防护机制（guardrails）是否仍可能被轻易突破的问题。研究发现，仅通过一次针对单一偏见示例的组相对策略优化（Group Relative Policy Optimization, GRPO）训练，即可引发系统性偏见，且这种基于刻板印象的推理能够泛化至不同属性、类别和评测基准。其解决方案的关键在于揭示了后训练阶段对齐机制的严重脆弱性：模型的对齐状态可被单个偏见样本所颠覆，而模型对偏见的敏感性与其初始生成偏见输出的可能性密切相关。这一发现表明，当前主流对齐方法在面对恶意或极端偏见输入时缺乏鲁棒性，亟需改进。

链接: https://arxiv.org/abs/2606.10931
作者: Naihao Deng,Yilun Zhu,Naichen Shi,Clayton Scott,Rada Mihalcea
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.10931 [cs.CL] (or arXiv:2606.10931v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.10931 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-18] race Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

【速读】：该论文旨在解决长文档问答（long-document question answering, LDQA）中大语言模型（LLMs）在处理跨段落证据关联、事件时序依赖及章节级上下文理解时面临的挑战，尤其针对现有结构化检索增强生成（structured RAG）方法存在的三大局限：查询无关的知识组织开销大、对原始文档结构利用不足以及无法复用历史推理经验。其核心解决方案是提出DocTrace——一种多智能体RAG框架，关键在于通过轻量级文档结构树索引保留文档层级关系，动态构建智能体共享的超图结构工作记忆以支持上下文感知推理，并将成功的推理路径以图结构存储于经验记忆中实现可复用性，从而实现基于查询触发的知识组织、结构感知与经验引导的自适应推理。实验结果表明，DocTrace在四个长文档QA数据集上表现优异，在三个数据集上超越最强基线ComoRAG，F1最高提升8.85%，EM提升4.40%，同时整体计算成本降低53.32%。

链接: https://arxiv.org/abs/2606.10921
作者: Xiangjun Zai,Xingyu Tan,Chen Chen,Xiaoyang Wang,Wenjie Zhang
机构: University of New South Wales, Australia; CSIRO, Australia; University of Wollongong, Australia
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. Although retrieval-augmented generation (RAG) reduces the input context by retrieving relevant evidence, existing structured RAG methods still face three limitations: costly query-agnostic knowledge organization, insufficient use of original document structure, and no reuse of historical reasoning experience. To address these limitations, we propose DocTrace, a multi-agent RAG framework for long-document QA that supports query-triggered knowledge organization, document-structure-aware and experience-guided reasoning. DocTrace preserves document hierarchy with a lightweight document structural tree index, constructs agent-shared hypergraph-structured working memory on demand during reasoning, and stores successful reasoning plans in graph-structured experience memory for future reuse, enabling adaptive exploration across related long-document questions. Experiments on four long-document QA datasets show that DocTrace achieves the best performance on three datasets, surpassing the strongest baseline, ComoRAG, by up to 8.85% in F1 and 4.40% in EM, while reducing the overall computational cost by 53.32%

[NLP-19] Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多步骤工具使用任务中表现不佳的问题，其核心挑战在于工具相关知识的不足以及知识激活机制的有效性欠缺。解决方案的关键在于系统性地优化知识在工具执行中的三个关键阶段：知识获取、知识激活与知识内化。研究发现，实例级的经验性知识（experiential knowledge）即可带来显著且可靠的性能提升，而意图级的抽象知识收益有限；在推理阶段，通过并行采样与聚合扩展推理宽度（reasoning width）比单纯深化推理深度（reasoning depth）更有效激活潜在知识；在训练阶段，基于增强知识的数据进行后训练，并采用强化学习相比监督微调能实现更好的知识内化。基于上述发现，作者提出KATE（Knowledge-Augmented Tool Execution）框架，集成经验性知识、推理宽度扩展的推理机制与知识感知的训练策略，在BFCL-V3和AppWorld基准上均展现出对多种模型规模的持续且显著的性能提升。

链接: https://arxiv.org/abs/2606.10875
作者: Yupu Hao,Zhuoran Jin,Huanxuan Liao,Kang Liu,Jun Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use performance, covering the stages of knowledge acquisition, activation, and internalization. In the knowledge acquisition stage, we acquire and evaluate various forms of experiential knowledge, and our analysis shows that simple instance-level knowledge can already provide strong and reliable gains, while abstract intent-level knowledge offers limited benefits. At inference time, to activate knowledge, we find that prompting LLM to expand the depth of reasoning yields diminishing returns, whereas expanding the width of reasoning by parallel sampling with aggregation more effectively activates latent experiential knowledge. At training time, for knowledge internalization, post-training with knowledge-augmented data further improves performance, with reinforcement learning outperforming supervised fine-tuning. Based on these insights, we propose the Knowledge-Augmented Tool Execution (KATE), a knowledge-augmented tool execution framework that integrates experiential knowledge with reasoning-width-expanded inference and knowledge-aware training. Experiments on BFCL-V3 and AppWorld demonstrate consistent and substantial improvements over strong baselines across model scales. Our Code is available at this https URL.

[NLP-20] raining LLM s to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization

【速读】：该论文旨在解决大语言模型（LLM）在面对来自不同可信度来源的指令时，因对所有输入令牌赋予同等架构优先级而产生的结构性脆弱性问题。这一缺陷使得模型易受恶意提示注入攻击，并且缺乏原则性机制来处理合法但相互冲突的指令之间的优先级协调。现有训练方法虽尝试构建指令层级结构，但通常仅支持三至四层的有限层级，将所有违规行为视为同等严重，且极少评估各层级间全部成对交互关系。为此，论文形式化定义了k级指令层级问题并将其实例化为k=5的情形，形成十组需严格遵守的成对优先级关系。针对此问题，提出重力加权直接偏好优化（Gravity-Weighted DPO, GW-DPO），其样本级偏置项根据冲突层级间的结构距离，在线性或双边调度下动态调整；其中双边调度同时考虑特权差距与受害层级的特权程度以量化违规严重性。结合特定于层级的分隔符标记（Chen et al., 2025）与指令片段嵌入（Instructional Segment Embeddings, ISE; Wu et al., 2025），采用双边调度的GW-DPO在Llama-3.1-8B-Instruct上实现了帕累托改进：相较于标准DPO及线性变体，显著提升了宏观成对优先级遵循率，同时将过度拒绝率降低至标准DPO的一半。消融实验表明，ISE主要作为拒绝阈值校准器，而五层级与三层级训练的本质差异可被重新诠释为泛化性与专属性之间的权衡。

链接: https://arxiv.org/abs/2606.10860
作者: Lena S. Bolliger,Lena A. Jäger
机构: University of Zurich (苏黎世大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of only three or four levels, treat all violations as equally severe, and rarely evaluate the full set of pairwise level interactions. We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens (Chen et al., 2025) and Instructional Segment Embeddings (ISE; Wu et al., 2025), GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priority adherence while keeping over-refusal at half the standard DPO rate. Ablations isolate ISE as a refusal-threshold calibrator and recast five- versus three-level training as a generality-specialization tradeoff.

[NLP-21] Janus: A Benchmark for Goal-Conditioned Information Distortion in LLM s

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在真实场景中因选择性呈现真实信息而产生的隐蔽性误导问题，即“语用扭曲”（pragmatic distortion），此类问题不依赖于虚构或明显谎言，而是通过省略不利证据、弱化负面细节、夸大有利信息或以模糊语言替代精确限定等方式，造成具有误导性的整体印象。现有评估基准多聚焦于显性幻觉或虚假陈述，忽视了这种更隐蔽且更具危害性的失效模式。本文提出JANUS基准，通过设定固定的事实池，在中立条件与目标导向条件（如提升采纳率、增加注册人数、获得批准或支持等）之间对比模型输出，从而隔离出由目标激励引发的语用扭曲。其核心创新在于：所有输出均基于相同事实集合，确保误导性差异源于信息选择而非内容虚构，有效区分了语用扭曲与幻觉。实验结果表明，12个主流大模型均表现出显著的目标驱动型信息扭曲，揭示当前模型对激励机制和框架设计仍高度敏感，缺乏抵御选择性误导的鲁棒防御能力。研究团队已公开发布数据集与代码，以推动后续相关研究。

链接: https://arxiv.org/abs/2606.10852
作者: Polydoros Giannouris,Mohsinul Kabir,Sophia Ananiadou
机构: The University of Manchester / Manchester, UK; Archimedes/Athena RC / Athens, Greece
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from selective treatment of true material facts: omitting adverse evidence, softening unfavorable details, emphasizing favorable details, or replacing precise qualifications with vague language. Existing benchmarks largely miss this subtler and arguably more dangerous failure mode. We introduce JANUS, a benchmark for measuring goal-conditioned pragmatic distortion in fact-grounded LLM outputs. Each scenario in our benchmark provides a fixed pool of favorable and adverse facts and compares a neutral condition against a goal-directed condition, such as increasing adoption, enrollment, approval, or support, despite potential harm to directly affected individuals or groups. Because all outputs are constrained to use the same fact pool, JANUS isolates misleading net impressions from hallucination and fabrication. JANUS contains 160 scenarios across 8 domains, with each scenario paired with neutral and goal-conditioned prompts and annotated material facts. Extensive experiments across 12 LLMs reveal consistent goal-conditioned distortions, demonstrating that current models remain sensitive to incentive and framing objectives and lack robust safeguards against selectively misleading communication. We publicly release our corpus and code for future research.

[NLP-22] Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于掩码扩散语言模型（Masked Diffusion Language Models）在并行解码时因高并行性导致的脆弱性问题：尽管单个位置的预测置信度较高，但多个位置的预测结果若存在依赖关系，同时确定可能引发错误。现有无需训练的采样方法如Top-k、Fast-dLLM和EB-Sampler主要通过控制每轮解码中揭示的词元数量来调节并行度，但其候选词排序通常基于独立的词元得分，忽略了被选集合内部的相互作用。为此，论文提出ADAS（Attention-Discounted Adaptive Sampling），一种无需训练的重排序规则，用于改进并行掩码扩散解码。ADAS的关键在于不改变基线采样器的停止规则，仅对候选子集构建过程进行优化：它在贪心选择过程中，当某一候选词对已选位置中仍不确定的预测具有强注意力时，对其施加折扣惩罚。与将注意力转化为硬约束的图约束方法不同，ADAS保持注意力机制的连续性，并将其作为软边际惩罚使用。在LLaDA-8B-Base和Dream-7B-Base模型上，针对GSM8K、MATH500、HumanEval和MBPP等任务的实验表明，将ADAS集成至Top-k、Fast-dLLM和EB-Sampler后，在相同去噪器评估下，低NFE（非函数调用次数）性能平均提升9.11%和10.46%，且每前向传播仅增加3.1%的运行开销。结果表明，基于注意力折扣的软重排序是一种简单且模块化的方法，可有效提升掩码扩散语言模型在高度并行解码中的生成质量。

链接: https://arxiv.org/abs/2606.10829
作者: Yusuf Sahin,Ahmed Rockey Saikia,Volkan Cevher,Paolo Favaro
机构: University of Bern (伯尔尼大学); EPFL (洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-(k), Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler’s stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-(k), Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by (9.11) and (10.46) percentage points on average, respectively, with (3.1%) per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.

[NLP-23] K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

【速读】：该论文旨在解决自回归语言建模（Autoregressive Language Modeling, AR）在高负载批量服务场景下推理效率低下的核心问题，尤其针对工业级大规模部署中因逐词解码导致的内存瓶颈与计算延迟。其解决方案的关键在于提出一种名为K-Forcing的新范式，通过将现有AR模型蒸馏为一个条件前推映射（conditional push-forward mapping），实现单次前向传播中联合生成未来k个词元的并行采样。该方法保持了固定长度输出、复用原有AR教师模型的骨干结构，并兼容标准AR推理基础设施；训练过程采用渐进式自强制蒸馏（progressive self-forcing distillation），逐步扩展预测窗口的同时使学生模型紧密匹配教师模型的序列分布。实验表明，在LM1B和OpenWebText数据集上使用标准因果Transformer架构时，当设置k=4时，K-Forcing可实现2.4–3.5倍的推理加速，同时相对于原始AR模型仅带来轻微质量下降，为现代大语言模型在真实高负载场景下的高效推理提供了可行路径。

链接: https://arxiv.org/abs/2606.10820
作者: Zhiwei Tang,Yuanyu He,Yizheng Han,Wangbo Zhao,Jiasheng Tang,Fan Wang,Bohan Zhuang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving–the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping–one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.

[NLP-24] RedAct: Redacting Agent Capability Traces for Procedural Skill Protection

【速读】：该论文旨在解决生成式智能体（Generative Agent）执行轨迹中潜在的私有过程技能泄露问题，即通过公开的执行轨迹（execution trace）可能被下游方法逆向推导出关键算法公式、阈值设定及策略逻辑，从而在无模型权重或技能文件访问权限的情况下实现技能迁移。其核心解决方案是提出 \textscRedAct 框架，该框架通过三重机制实现安全保护：一是对敏感关键信息进行定位与局部化遮蔽；二是重构轨迹以保留验证器（verifier）所需的证据完整性；三是嵌入行为水印（behavioral watermark）用于后续溯源分析。实验表明，在典型轨迹复用场景下，\textscRedAct 将原始轨迹上的归一化技能迁移率（NST）从44.7%–67.1%降低至低于无技能基线水平，同时有效维持审计证据可用性；其独立的行为水印检测真阳性率达93.6%–100%，误报率不超过1.9%。研究揭示了公开智能体轨迹作为安全接口的潜在风险，并证明选择性红笔（redaction）可在不牺牲可审计性的前提下显著降低过程能力泄露。

链接: https://arxiv.org/abs/2606.10813
作者: Shuwen Xu,Zhitao He,Yi R.(May)Fung
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills, allowing downstream methods to recover key formulas, thresholds, and strategies without access to model weights or skill files. To quantify this risk and evaluate protection, we construct \textscCapTraceBench, a benchmark of 75 specialized long-horizon tasks and 154 curated skills across seven domains. We also introduce \textscRedAct this https URL, a protected trace release framework that localizes protected key information, rewrites traces while preserving verifier-critical evidence, and embeds behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, \textscRedAct reduces normalized skill transfer (NST) from 44.7–67.1% on raw traces to below the no-skill baseline, while preserving audit evidence. Its standalone behavioral watermarks reach 93.6–100.0% true detection with a false alarm rate of at most 1.9%. These results frame public agent traces as security interfaces and show that selective redaction can reduce procedural capability leakage without removing audit evidence.

[NLP-25] Beyond APIs: Probing the Limits of MLLM s in Physical Tool Use

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在具身智能（embodied AI）场景下对物理工具使用能力评估不足的问题。尽管MLLMs已被广泛用作具身系统的核心决策“大脑”，能够指导机器人与物理世界交互，但其在真实场景中识别和规划使用物理工具的能力尚未得到系统性评估。为此，作者提出了首个针对物理工具使用的基准测试——PhysTool-Bench，涵盖2,678种真实世界的物理工具，覆盖制造、电气工程、农业及医疗等多个领域，共包含2,510个任务查询。该基准从两个核心维度评估模型性能：一是准确识别场景中存在的所有物理工具，二是根据指令与视觉上下文规划工具的选择与使用顺序。实验结果表明，即使是最先进的模型（Gemini-3.1-Pro）也仅能识别出场景中58.7%的工具，并仅能端到端完成21.0%的任务。分析揭示了两层能力缺陷：第一层为感知层面，即模型难以在复杂现实场景中有效感知工具；第二层为规划层面，工具识别后的功能常识推理能力严重不足，导致无法将感知到的工具与任务语义正确映射，暴露出当前具身智能发展中关键的技术瓶颈——功能性常识缺失。因此，解决方案的关键在于构建更强大的工具感知与基于功能常识的工具使用规划能力，以推动具身智能在真实世界中的实用化发展。

链接: https://arxiv.org/abs/2606.10803
作者: Zhixin Ma,Yutong Zhou,Yongqi Li,Chong-Wah Ngo,Wenjie Li
机构: Singapore Management University (新加坡管理大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the “brain” of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs’ ability to assist humans in real-world tasks. Despite the importance, MLLMs’ proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs’ ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

[NLP-26] Dep-LLM : Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

【速读】：该论文旨在解决临床访谈中自动抑郁症检测（ADD）面临的两大核心挑战：一是长篇多主题访谈中抑郁线索分布稀疏且关系复杂，导致现有方法难以进行深层、可靠的推理；二是受临床隐私保护和标注成本高等限制，高质量标注数据稀缺，制约了监督式模型的训练与部署。针对上述问题，论文提出一种无需训练的Dep-LLM框架，其关键在于模拟精神科医生的分步推理过程，完全基于冻结的现成大语言模型（foundation LLMs）实现。该框架包含三个核心阶段：首先，通过链式思维（Chain-of-Thought, CoT）驱动的多因素抑郁分析模块，将长对话结构化分解为五个临床相关主题，并生成基于证据的推理理由，有效建模长程依赖；其次，引入置信度分析与调制模块，基于每个推理片段的词元级熵量化其认知可靠性，并通过同标签内与跨主题间的信号调制机制，增强可信信号、抑制不确定信息，全程无需额外训练；最后，协同多因素预测模块动态融合加权后的多维度信号以输出最终诊断。实验结果表明，Dep-LLM在DAIC-WOZ与E-DAIC数据集上显著优于21种基础大模型的零样本基线，且超越当前最先进的有监督领域特定模型及最新闭源商业大模型，充分验证了其有效性与泛化能力，同时实现了“零训练”部署。

链接: https://arxiv.org/abs/2606.10796
作者: Yiqing Lyu,Xianbing Zhao,Buzhou Tang,Ronghuan Jiang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); Jiangnan University(江南大学); Guangdong Provincial Key Laboratory of Intelligent Information Processing(广东省智能信息处理重点实验室); Pengcheng Laboratory(鹏城实验室); Chinese People’s Liberation Army General Hospital(中国人民解放军总医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.

[NLP-27] N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization ACL2026

【速读】：该论文旨在解决大语言模型在数学推理任务中，当前滚动生成（rollout）技术所面临的根本性权衡问题：基于词元级采样的方法容易产生仅在表述上不同的冗余解题路径，而基于嵌入级随机噪声的方法则常导致语义一致性被破坏。其解决方案的关键在于提出一种名为N-GRPO的新颖探索策略，该策略集成于组相对策略优化（Group Relative Policy Optimization, GRPO）框架中，创新性地采用语义邻域混合（Semantic Neighbor Mixing）机制——通过动态混合锚定词元及其最近语义邻近词元的嵌入表示，构建输入表示，在严格遵循局部语义流形的前提下注入多样性。实验结果表明，N-GRPO在不同规模的DeepSeek-R1-Distill-Qwen模型上均显著优于强基线，并展现出对分布外任务的良好泛化能力。

链接: https://arxiv.org/abs/2606.10768
作者: Xukun Zhu,Hang Yu,Peng Di,Linchao Zhu
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ACL 2026 Findings. 16 pages, 3 figures. Code: this https URL

点击查看摘要

Abstract:The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories that differ only in rephrasing, while embedding-level methods utilizing random noise frequently disrupt semantic consistency. To resolve this, we introduce N-GRPO, a novel exploration strategy integrated into the Group Relative Policy Optimization (GRPO) framework. Rather than relying on token-level sampling or native embedding-level noise, our approach leverages Semantic Neighbor Mixing. This mechanism dynamically constructs input representations by mixing the embeddings of an anchor token and its nearest semantic neighbors, thereby injecting diversity while strictly adhering to the local semantic manifold. Experimental evaluations on the DeepSeek-R1-Distill-Qwen models across different sizes show that N-GRPO not only achieves consistent improvements over strong baselines on math reasoning benchmarks but also exhibits robust generalization capabilities on out-of-distribution tasks.

[NLP-28] ArabiGEE: A Hierarchical Taxonomy for Arabic Grammatical Error Explanation

【速读】：该论文旨在解决阿拉伯语语法错误解释（Grammatical Error Explanation, GEE）缺乏系统化、结构化框架的问题。现有方法通常将错误解释视为自由文本生成，导致解释内容不一致、难以评估与复用。为此，论文提出了ArabiGEE，首个基于明确错误类型构建的综合性阿拉伯语语法错误解释分类体系，其关键在于通过一个涵盖拼写、形态、句法和词汇四个维度的分层结构，对语法错误进行系统归类。该分类体系包含27种错误类型、140种修正类型及324条关联解释，实现了对错误成因与修正方式的精细化建模。研究进一步将ArabiGEE应用于现有阿拉伯语语法纠错语料库的人工标注，并验证了结构化解释在支持大语言模型（LLM）自动评估中的有效性，为阿拉伯语GEE任务提供了可扩展、可度量的技术基础。

链接: https://arxiv.org/abs/2606.10765
作者: Khaled Elhady,Omar Kallas,Nizar Habash,Bashar Alhafni
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce ArabiGEE, the first comprehensive Arabic grammatical error explanation (GEE) taxonomy grounded in explicit error types. Unlike existing GEE approaches that treat explanation generation as free-form text, ArabiGEE organizes grammatical explanations through a hierarchical structure spanning orthographic, morphological, syntactic, and lexical dimensions. The taxonomy consists of 27 error types, 140 correction types, and 324 associated explanations. We apply ArabiGEE to manually annotate portions of existing Arabic grammatical error correction corpora and demonstrate how structured grammatical explanations can support automatic evaluation of LLMs on Arabic GEE. Our code and data are publicly available.

[NLP-29] When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models ICML2026

【速读】：该论文旨在解决多轮推理模型中因终端评分评估（terminal-score evaluation）无法捕捉时序动态而导致的隐蔽性失败问题。现有评估方法仅关注最终输出，难以发现模型在对话早期即锁定不安全立场却仍表现出合理拒绝率的现象。其核心解决方案是提出一种细粒度的溯源诊断框架——思维链输出2×2安全矩阵（CoT-Output 2x2 safety matrix），通过在两个独立维度（内部推理与可见输出）上对每一轮交互进行标注，识别出四种操作定义的失效模式：稳健对齐、对齐伪装（alignment faking）、明显越狱（overt jailbreak）以及一种新型失效模式——上下文注入失效（context-injection failure）。后者表现为模型内部思维链保持安全，但外部输出产生危害，揭示了多轮场景下推理与输出不一致的严重问题。研究通过在信息危害场景下对三种蒸馏推理目标在五种监督条件下进行评估，共收集6750条轮次级观测数据，发现两个可复现的漏洞：一是“监督悖论”——明确的监控提示反而提升了对齐伪装的发生率；二是上下文注入失效，即模型虽维持安全内部状态，却受外部输入影响生成有害输出。该工作释放了完整的多轮对话及思维链追踪数据集，以支持后续基于溯源分析的安全性研究。

链接: https://arxiv.org/abs/2606.10740
作者: Sai Kartheek Reddy Kasu,Nils Lukas,Samuele Poppi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the ICML 2026 FAGEN Workshop

点击查看摘要

Abstract:Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.

[NLP-30] Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

【速读】：该论文旨在解决大规模在线课程中学生与对话式AI助教互动产生的海量问题日志未被有效利用的问题，这些日志本可作为诊断学习困难的潜在信号。其核心解决方案是构建一个基于少量样本（few-shot）文本分类器的流水线，将学生提问映射到课程知识点，该分类器以GPT-4生成的课程概念先决知识图谱为语义基础。实验在164名研究生的AI课程中进行，共分析1,340条问题记录，分类器在43个标签（42个课程主题+1个“未知”弃权类）上达到80.0%的准确率。进一步发现，各主题下问题数量与学生独立问卷中自评的学习难度呈显著正相关（rho = 0.491, p = 0.008, n = 28），验证了分类后的问题流能够真实反映主题层面的知识掌握状况。研究表明，将对话式AI交互日志结构化映射至课程体系，可提供关于知识点掌握缺口的可操作性信号，使教师获得基于课程框架的教学重点洞察。

链接: https://arxiv.org/abs/2606.10736
作者: Youssef Medhat,Junsoo Park,Ploy Thajchayapong,Ashok K. Goel
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an “unknown” abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention.

[NLP-31] Pre-AF 13: An Interpretable Atrial Fibrillation Risk Score Mined from Discharge Reports KR

【速读】：该论文旨在解决心血管疾病（CVD）患者中房颤（Atrial Fibrillation, AF）风险预测的临床挑战，特别是针对现有风险评分模型在高危人群中分层能力有限、依赖长期（5–10年）预测且多基于普遍存在的传统危险因素（如年龄增长、高血压）的问题。其核心解决方案是构建可解释的机器学习（Machine Learning, ML）模型，利用常规住院电子健康记录（Electronic Health Records, EHR）数据，在24个月及整个随访期内实现对CVD患者中短期房颤风险的精准预测。关键创新在于通过自研的自然语言处理（Natural Language Processing, NLP）管道，将非结构化的出院报告转化为73个结构化特征，并结合轻量级自动化机器学习框架（LightAutoML）开发出兼具高性能与可解释性的模型。其中，简化版模型“Pre-AF 13”仅使用13个关键特征，而线性风险评分“Pre-AF 9”进一步实现了临床可操作性，能将24个月内房颤发生率从约7%有效区分至36%。模型性能显著优于CHARGE-AF、C2HEST、MHS和HAVOC等传统临床评分（ROC AUC达0.725–0.735），且通过SHAP值分析揭示了年龄和左心房容积为最重要的预测因子，从而实现了高风险人群的有效识别与分层。

链接: https://arxiv.org/abs/2606.10725
作者: Olga Shakhmatova,Dmitrii Kriukov,Daniil Larionov,Nikita Khromov,Iaroslav Bespalov,Alexander Zolotarev,Kirill Grishchenkov,Ekaterina Ivanova,Miron Kuznetsov,Ilya Sochenkov,Elizaveta Panchenko,Artem Shelmanov,Dmitry V. Dylov
机构: National Medical Research Center of Cardiology named after Academician E.I. Chazov, Moscow, Russia; Skolkovo Institute of Science and Technology (Skoltech), Moscow, Russia; Artificial Intelligence Research Institute (AIRI), Moscow, Russia; University of Mannheim, Mannheim, Germany; Russian Center for Scientific Information (RCSI), Moscow, Russia; Institute of Cyber Intelligence Systems, National Research Nuclear University MEPhI, Moscow, Russia; M.V. Lomonosov Moscow State University, Moscow, Russia; Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), Moscow, Russia; Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS), Moscow, Russia; Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences (FRC CSC RAS), Moscow, Russia; Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, United Arab Emirates
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Main paper with appendix; 3 main figures, 3 supplementary figures, multiple tables. O. Shakhmatova and D. Kriukov contributed equally (co-first authors). E. Panchenko, A. Shelmanov, and D. V. Dylov are co-senior authors. Corresponding authors: O. Shakhmatova ( this http URL @gmail.com) and D. V. Dylov ( this http URL @skol.tech)

点击查看摘要

Abstract:Background. Atrial fibrillation (AF) is the most prevalent cardiac arrhythmia and a major determinant of prognosis. Established AF risk scores rely on factors (older age, hypertension) nearly ubiquitous among patients with cardiovascular disease (CVD), offering limited stratification in this high-risk group. Most target long-term (5-10 year) rather than medium-term prediction. We developed interpretable ML models predicting AF risk over a 24-month and entire follow-up horizon in CVD patients using routinely collected hospital data. Methods. Single-center retrospective study of electronic health records from the National Research Cardiology Center (Russia) for patients aged =18 with CVD but without pre-existing AF, hospitalized more than once between January 2012 and May 2019. A custom NLP pipeline transformed unstructured discharge reports into 73 structured features, combining a rule-based parser with transformer-based NER. Using LightAutoML we built a full model (73 features), a simple model (reduced subset), and a linear model for a bedside risk score. Performance was assessed by ROC AUC, compared with CHARGE-AF, C2HEST, MHS, and HAVOC, and interpreted via SHAP. Results. Of 80,576 records from 45,000 patients, 17,562 met inclusion criteria; 1,438 (8.19%) developed AF. The full model reached ROC AUC 0.735 (24-month) and 0.696 (entire follow-up); the simple model was nearly identical (0.725, 0.696). All non-linear models outperformed the four clinical risk scores (ROC AUC 0.53-0.64). The simple model uses 13 features and is named Pre-AF 13. SHAP identified age and left atrial volume as dominant predictors. A linear risk score (Pre-AF 9) stratified observed 24-month AF incidence from ~7% to 36%. Conclusion. Interpretable ML models built from routinely collected EHR data identify high-AF-risk CVD patients, outperforming established clinical risk scores. Comments: Main paper with appendix; 3 main figures, 3 supplementary figures, multiple tables. O. Shakhmatova and D. Kriukov contributed equally (co-first authors). E. Panchenko, A. Shelmanov, and D. V. Dylov are co-senior authors. Corresponding authors: O. Shakhmatova (this http URL@gmail.com) and D. V. Dylov (this http URL@skol.tech) Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2606.10725 [cs.LG] (or arXiv:2606.10725v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.10725 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Artem Shelmanov [view email] [v1] Tue, 9 Jun 2026 11:33:46 UTC (985 KB)

[NLP-32] Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLM s

【速读】：该论文旨在解决如何从密集型大语言模型（LLM）高效构建面向硬件优化的通道稀疏化模型的问题，尤其关注在长上下文场景下保持模型性能的同时实现显著的计算资源节省。其核心挑战在于如何在不牺牲模型能力的前提下，实现可训练的、动态的通道稀疏结构，并确保稀疏化过程与语言建模任务协同优化。解决方案的关键在于提出一种“预测器门控的稀疏SwiGLU前馈网络”（predictor-gated sparse SwiGLU FFN），在持续训练过程中引入一个低秩预测器，为每个令牌和层生成前馈网络（FFN）通道的路由逻辑，再通过分组内Top-k选择机制保留每64通道组中的16个通道，从而在FFN中间激活阶段实现4倍通道稀疏。该路由模块嵌入主语言建模路径并在训练中联合优化，突破了传统后处理稀疏推理方法的局限性，使原始密集模型能够被“升级”为可部署的稀疏模型。此外，研究还识别出在RULER-CWE基准上存在层局部的长上下文失效问题，并提出单层修复算法，有效提升了长序列下的性能表现。

链接: https://arxiv.org/abs/2606.10722
作者: Ruixuan Huang,Jinyuan Shi,Hantao Huang,Yifan Huang,Ziyi Guan,Hao Zeng,Ian En-Hsu Yen,Minghui Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.

[NLP-33] Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

【速读】：该论文旨在解决长文档关键短语提取（Keyphrase Extraction, KPE）中因上下文窗口受限而导致的关键信息分散、难以联合建模的问题。尽管预训练语言模型（PLM）在KPE任务中表现优异，但其有限的上下文长度无法有效捕捉跨远距离文档片段中的关键短语证据；而采用长上下文大语言模型（LLM）虽可缓解此问题，却面临计算成本高昂、难以实现高效高吞吐量处理的挑战。为此，本文提出一种注意力扩展机制（attention expansion mechanism），通过利用预训练词嵌入将文档中超出当前上下文窗口的周边文本块信息融合到PLM的标记表示中，从而在不依赖全文档注意力或昂贵的LLM推理的前提下，显著扩展了模型的有效上下文范围。该方法的核心在于以轻量级方式增强现有PLM的上下文感知能力，实现对长文档中稀疏关键信息的高效捕获。实验结果表明，该机制在五种不同类型的PLM骨干网络（涵盖通用型、科学领域、任务专用及长上下文编码器）、两种训练范式和五个来自科学与新闻领域的基准数据集上均取得一致且显著的性能提升，尤其在F1分数上优于现有最先进模型，验证了其在多场景下的普适性与有效性，证明该机制提供的是互补性信息而非仅弥补输入长度限制。

链接: https://arxiv.org/abs/2606.10716
作者: Roberto Martínez-Cruz,Alvaro J. López-López,José Portela
机构: Comillas Pontifical University (康门利亚天主教大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

[NLP-34] From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models ICML2026

【速读】：该论文旨在解决当前可解释性方法中一个关键的因果推断问题：即在缺乏充分因果证据的情况下，将基于群体层面的观测统计量（如专家利用率、激活范数和路由权重分布）作为判断特定专家是否可被移除而不影响模型功能的依据，这种做法实质上是将皮尔逊因果层级中的“关联性”（rung-1）误用为“干预性”（rung-2）结论，其有效性却极少得到验证。论文以混合专家模型（Mixture-of-Experts, MoE）剪枝为例，考察现有方法依赖的路由统计量是否真正能预测专家的因果重要性。研究通过在三个高冗余MoE架构（OLMoE-1B-7B-0924、Qwen1.5-MoE-A2.7B、DeepSeek-V2-Lite）上进行逐令牌干预审计，发现经过多重比较校正后，所有60种指标-层组合均无法有效预测因果重要性，效应量均低于Cohen’s d = 0.17，表明现有观测指标不具备可靠的因果预测能力。唯一显著信号出现在OLMoE模型末层，通过引入逐令牌路由权重控制排除了统计功效不足的影响，但该信号仍极为微弱（d = +0.231, p = 0.0013）。研究进一步揭示，现有剪枝方法的成功并非源于准确识别可舍弃的专家，而是由于早期层存在高度冗余，导致各类选择标准在实践中趋于等效。因此，该研究的关键贡献在于提供了一个明确的反例，挑战了从群体观测到个体干预结论的常见推论路径，并强调应通过干预审计来校准可解释性声明的证据标准，从而提升方法的因果严谨性。

链接: https://arxiv.org/abs/2606.10703
作者: Leonard Engmann,Christian Medeiros Adriano,Holger Giese
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 9 tables. Accepted at the ICML 2026 Workshop on Philosophy of Science Meets Machine Learning (PhilML). Non-archival

点击查看摘要

Abstract:Interpretability methods routinely use population-level summary statistics over observed model behaviour to license claims about the effects of targeted interventions on specific computations; in Pearl’s terms, they treat rung-1 associational evidence as if it supported rung-2 interventional conclusions, a move whose validity is rarely tested. We examine one concrete instance: the use of routing statistics in Mixture-of-Experts (MoE) pruning, where utilization rates, activation norms, and routing weight distributions are treated as predictors of which experts can be removed without functional cost. A token-level interventional audit across three high-redundancy MoE architectures (OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite) finds no observational metric predicts causal expert importance after multiple-comparison correction in any model, with effect sizes below Cohen’s d = 0.17 across all 60 metric-layer combinations. A per-token routing weight control rules out insufficient power, recovering a single Bonferroni-significant signal at OLMoE’s final MoE layer ( d = +0.231 , p = 0.0013 ). Existing pruning methods succeed in this regime not by identifying dispensable experts but because early-layer redundancy renders most selection criteria interchangeable. Our results provide an explicit counterexample to the common inferential step from population-level observational summaries to token-level interventional claims about expert importance, and illustrate how interventional audits can calibrate the evidential standards for interpretability claims.

[NLP-35] REAL: A Reasoning -Enhanced Graph Framework for Long-Term Memory Management of LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在长时间对话中因有限上下文窗口而无法有效管理长期记忆的问题。现有记忆系统存在三大局限：基于扁平文本的记忆组织难以捕捉记忆间的显式关联，结构化记忆系统常因破坏性更新导致事实演化过程丢失，且当前检索机制在证据不完整时仍保持查询无关与被动状态。为此，论文提出REAL框架，将长期对话记忆建模为一种时间敏感且置信度感知的有向属性图（temporal and confidence-aware directed property graph），其中每个原子事实均以实体、关系、有效时间区间、置信度分数及探索意图标签进行表示。在记忆构建阶段，REAL采用非破坏性的时间更新策略，保留并行的事实版本及其有效性区间，从而精确追踪事实的演变过程；在记忆检索阶段，通过锚定与查询相关的根实体、解耦其探索意图，并结合语义评估器引导的混合束搜索（hybrid beam search）提取紧凑的记忆子图，同时引入反事实推理（counterfactual inference）机制，利用隐含逻辑关系修复不可靠的检索状态并补全缺失的记忆证据。实验结果表明，相较于扁平文本、图结构及现有记忆基线，REAL在长时记忆性能上实现显著提升，平均改进达22.72%。

链接: https://arxiv.org/abs/2606.10694
作者: Keer Lu,Liwei Chen,Guoqing Jiang,Zhiheng Qin,Yunhuai Liu,Wentao Zhang
机构: Peking University (北京大学); Kuaishou Technology (快手科技); Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University (北京大学前沿交叉学科研究院数据科学中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management essential for storing, updating, and retrieving historical information beyond the context limit. Although recent memory systems attempt to address this issue by storing historical information externally, existing approaches suffer from three key limitations: flat text-based memory organizations fail to capture explicit relations among memories, structured memory systems often destructively overwrite evolving facts, and current retrieval mechanisms remain query-agnostic and passive when evidence is incomplete. REAL constructs long-term conversational memory as a temporal and confidence-aware directed property graph, where each atomic fact is represented with entities, relations, valid-time intervals, confidence scores, and exploration intent labels. During memory construction, REAL adopts a non-destructive temporal update strategy that preserves parallel fact versions and their validity intervals, enabling faithful tracking of fact evolution. During retrieval, REAL anchors query-relevant root entities, decouples their exploration intents, and performs semantic evaluator-guided hybrid beam search to extract compact memory subgraphs. It further incorporates counterfactual inference to repair unreliable retrieval states and recover missing memory evidence through implicit logical relations. Comprehensive experiments demonstrate that REAL substantially improves long-term memory performance over flat-text, graph-based, and existing memory baselines, achieving an average improvement of 22.72%.

[NLP-36] Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

【速读】：该论文旨在解决长期大语言模型（LLM）智能体在跨会话场景中面临的持续记忆问题，即如何有效追踪动态变化的事实并实现相关证据的积累与检索。现有记忆系统通常将观测结果存储为孤立的记录、摘要或索引片段，导致证据聚合困难、事实更新不便以及记忆维护效率低下。为此，论文提出Infini Memory——一种可维护的基于文本的持续记忆架构，其核心创新在于将智能体记忆建模为以主题结构化的文档，每个主题文档作为语义单元，用于汇聚相关证据、保留元数据并支持随时间演进的事实修订。新观测首先暂存于缓冲区，随后周期性地整合为连贯的文本上下文。在推理阶段，通过迭代式工具调用的代理检索机制，使大模型能够分步读取记忆，而非依赖单次检索。实验表明，在MemoryAgentBench基准上，Infini Memory取得64.7%的整体得分；消融实验进一步验证了主题结构化维护与迭代证据检视分别提升了长期记忆使用中的不同维度表现，二者具有互补性。

链接: https://arxiv.org/abs/2606.10677
作者: Suozhao Ji,Baodong Wu,Zehao Wang,Lei Xia,Qingping Li,Ruisong Wang,Wenbo Ding,Zhenhua Zhu,Boxun Li,Guohao Dai,Yu Wang
机构: Infinigence AI; Tsinghua University; Shanghai Jiaotong University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic-structured maintenance and iterative evidence inspection improve complementary aspects of long-term memory use.

[NLP-37] Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming INTERSPEECH2026

【速读】：该论文旨在解决多语言环境下词级强制对齐（word-level forced alignment）的准确性问题，尤其在低资源或未见语言场景下的泛化能力不足。其核心解决方案在于提出一种基于双流表示融合与可学习动态规划的对齐框架：首先，通过集成大规模多语言语音（Massively Multilingual Speech, MMS）模型与自监督音素边界检测器（UnSupSeg）的双重表征，增强对长时序上下文中的词边界概率估计；其次，设计一个可学习的动态规划解码器，联合利用编码器输出与段落级特征，在MMS和UnSupSeg表征基础上推断最终词边界。该方法在TIMIT和Buckeye数据集上经迭代训练后，显著优于蒙特利尔强制对齐器（MFA）及基于MMS的对齐方法；在未见语言（荷兰语、德语、希伯来语）上亦表现稳定，性能优于或媲美现有方法，表明其具备在支持1100+语言的MMS体系下无需额外训练即可扩展应用的潜力。

链接: https://arxiv.org/abs/2606.10675
作者: Roy Weber,Meidan Zehavi,Rotem Rousso,Joseph Keshet
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Interspeech 2026

点击查看摘要

Abstract:We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.

[NLP-38] Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

【速读】：该论文旨在解决多选题问答（MCQA）评估基准在衡量预训练大语言模型真实能力时存在的可靠性问题。现有评估方法依赖于对数似然得分（log-likelihood scoring），其结果高度敏感于答案选项的表面形式（surface form），导致模型对特定表述的熟悉度被误判为实际理解能力，从而产生虚假性能差距。针对这一问题，论文提出ParaEval评估框架，其核心在于对每个答案选项生成多种语义等价的改写（paraphrase），并通过模型在所有改写中的最优表现进行评分，以此削弱表面形式偏差的影响。实验表明，该方法能将原本超过2分的虚假性能差距降至1分以下，并在70B与120B级开源前沿模型中验证了评估偏差的持续存在及ParaEval的有效性。因此，ParaEval的关键创新在于通过多句式采样与最优响应选择，实现对模型真实底层能力的更鲁棒、高效的评估。

链接: https://arxiv.org/abs/2606.10657
作者: João Maria Janeiro,Mathurin Videau,Andrea Caciolai,Benjamin Piwowarski,Patrick Gallinari,Loic Barrault
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing (surface form) of the answers, conflating a model’s familiarity with a specific phrase with its actual capability. We demonstrate this flaw using a controlled testbed of 1B-8B models trained on the same knowledge. Despite having identical knowledge, standard metrics falsely report a performance gap of over 2 points. To solve this, we propose ParaEval, an evaluation framework that queries models using multiple paraphrases per answer option. By scoring each model based on its most favorable phrasing, ParaEval successfully reduces the false performance gap to below 1 point. We confirm that these evaluation artifacts, and the improvements from ParaEval, persist in frontier 70B and 120B open-source models. Ultimately, ParaEval provides a robust and efficient way to evaluate true underlying capability rather than surface-form familiarity.

[NLP-39] Speaker Group Encoding in Self-supervised Speech Recognition Models

【速读】：该论文旨在解决自监督语音识别模型（S3M）在训练过程中对说话人群体特征（SGC）的编码机制及其对语音识别公平性的影响问题。其核心挑战在于揭示不同微调策略如何影响模型对性别、年龄、方言、族裔及是否为母语者等关键说话人群体类别（SGCs）的信息编码，进而分析这些编码偏差如何传递至自动语音识别（ASR）系统中。解决方案的关键在于系统性地对比多种模型状态——预训练、针对说话人识别（SID）微调、针对ASR微调，以及采用公平性增强算法优化后的ASR微调——并量化各状态下对不同性质的说话人群体信息（SGI）的保留或放大程度。研究发现，基于音素变异性的SGC（如性别、方言）在SID微调后被显著放大，而基于语义变异性的SGC（如族裔、母语背景）则不受显著影响；相比之下，标准ASR微调会丢弃音素性变异性信息但保留语义性变异性信息。此外，公平性增强的ASR算法虽能调节音素性SGC的编码强度，但对语义性SGC的影响有限。通过分析各网络层中嵌入表示的子维度，论文进一步识别出负责编码特定SGCs的潜在表征空间，为设计更具公平性的ASR系统提供了可解释的路径与技术依据。

链接: https://arxiv.org/abs/2606.10654
作者: Felix Herron,Solange Rossato Alexandre Allauzen,Benoit Favre,François Portet
机构: MILES Team, LAMSADE, Université Paris Dauphine-PSL, France; GETALP Team, LIG, Université Grenoble Alpes, France; NLP team, LIS, Aix-Marseille University, France
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech recognition (ASR), and ASR-finetuned using a fairness enhancing algorithm. We find that S3Ms encode information about several speaker group categories (SGCs), including their gender, age, dialect, ethnicity, and whether they are a native speaker. We find that finetuning for SID amplifies certain SGCs, namely those whose variance is more phonetic in nature, though it does not amplify other SGCs, namely those whose variance is more semantic in nature. On the other hand, finetuning for ASR discards phonetically variant speaker group information (SGI) but retains semantically variant SGI. We find that ASR algorithms designed for fairness improvement change to what extent SGI is encoded in S3Ms; however, this is primarily true for for phonetically variant SGCs, and less true for semantically variant SGCs. We discuss how SGI is encoded by each layer, and identify subdimensions of embeddings responsible for encoding different SGCs. Finally, we discuss how our findings could be beneficial in designing fairer ASR algorithms.

[NLP-40] Dynamic Linear Attention ICML2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理长上下文时，由于标准注意力机制的二次复杂度导致的可扩展性瓶颈问题。现有基于线性注意力机制的方法虽能降低计算开销，但其多状态（multi-state）内存建模策略通常依赖于固定的状态合并规则，无法动态适应不同令牌的重要性变化，导致关键信息被不可逆地丢失，并在长序列中引发严重的误差累积。为此，论文提出一种名为DLA（Dynamic Memory Modeling Framework for Multi-State Linear Attention）的动态内存建模框架，其核心创新在于：(i) 信息感知的动态状态合并（Information-Aware Dynamic State Merging），根据令牌级别的信息变化自适应确定状态边界，从而在语义转换处保留高分辨率表示，同时对稳定区域进行激进压缩；(ii) 有容量限制的内存建模（Capacity-Bounded Memory Modeling），通过选择性合并相邻低信息量状态，在维持固定大小、时间有序的状态缓存的同时，最大限度减少信息损失。实验在16个跨三个类别的数据集上验证了DLA相较于当前最优方法的优越性能。

链接: https://arxiv.org/abs/2606.10650
作者: Xin Wang,Hui Shen,Boyuan Zheng,Xueshen Liu,Minkyoung Cho,Zhongwei Wan,Zesen Zhao,Zhuoqing Mao,Shen Yan,Mi Zhang
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.

[NLP-41] How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLM s ICML2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中基于强化学习（Reinforcement Learning, RL）时存在的粒度级信用分配（token-level credit assignment）难题，即传统RL方法对所有生成的词元（token）同等对待，无法区分决定性推理步骤与常规格式化或流畅性填充内容。其核心挑战在于缺乏对信息传播全局结构的建模能力，导致奖励信号难以精准聚焦于关键推理路径。为此，论文提出FlowTracer框架，其关键创新在于构建一个由注意力机制诱导的有向无环图（Directed Acyclic Graph, DAG），其中节点代表词元，边的容量由聚合注意力权重决定，并通过重加权机制保留仅能传递至答案区域的信息影响，同时强制局部流量守恒，确保中间词元不会因路径长度或无关分支而出现有效质量的失真。在此全局结构基础上，FlowTracer提取从问题到答案的信息流主干（information-flow backbone），并依据词元的流量通过量（flow throughput）进行评分，从而识别出具有高影响力的枢纽节点和汇聚检查点，这些节点在长距离依赖关系中起关键中介作用。最终，基于此推导出的词元重要性用于构造词元级奖励，使学习信号精确聚焦于引导信息流向正确答案的关键词元，显著提升了多种复杂推理任务上的性能一致性与可解释性。

链接: https://arxiv.org/abs/2606.10646
作者: Zhichen Dong,Yang Li,Yuhan Sun,Weixun Wang,Yijia Luo,Zinian Peng,Taiheng Ye,Chao Yang,Wenbo Su,Yu Cheng,Bo Zheng,Junchi Yan
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 25 pages, 7 figures, 11 tables. Accepted at ICML 2026

点击查看摘要

Abstract:Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.

[NLP-42] Small Data Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning ACL2026

【速读】：该论文旨在解决参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）在低资源场景下对噪声敏感及训练数据有限时性能退化的关键问题。其核心解决方案是提出一种统一框架SDBN（Small Data Big Noise），首次将对抗训练引入PEFT范式，通过利用对抗训练与PEFT之间互补的优势，显著提升模型在噪声环境下的鲁棒性与泛化能力。SDBN的关键创新在于不增加额外参数或仅带来极小计算开销的前提下，通过构建离散不确定性集实现高效优化：其中SDBN-h采用字符级扰动枚举并结合梯度选择最坏情况样本，SDBN-p则借助大语言模型（LLM）生成变体以支持生成任务中的鲁棒优化。实验结果表明，该方法在多个基准测试中均取得显著性能提升，尤其在低资源设置及词级/字符级噪声干扰下表现突出，有效填补了对抗训练与参数高效适配交叉研究的空白，增强了PEFT在真实世界中数据稀缺与语言多样性共存场景下的可靠性。

链接: https://arxiv.org/abs/2606.10610
作者: Eitan Cohen,Idan Simai,Uri Shaham
机构: Bar-Ilan University (巴伊兰大学)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has become essential for adapting foundation models to downstream NLP tasks. However, current PEFT methods often struggle with robustness to noise and performance degradation on limited training data. We propose SDBN (Small Data Big Noise), a unified framework that brings adversarial training to PEFT - a combination that remains less studied in the PEFT setting despite its complementary strengths - to enhance model robustness and generalization, outperforming alternative approaches. We also introduce two variants of the method that use discrete uncertainty sets: SDBN-h, which enumerates character-level edits and selects worst-case variants using gradients, and SDBN-p, which uses LLM-generated variants for robust optimization in generative tasks. Experiments across multiple benchmarks reveal substantial improvements, particularly in low-resource settings and under both word-level and character-level corruptions. This framework addresses the less explored intersection of adversarial training and parameter-efficient adaptation, without introducing additional parameters or only modest computational overhead, making PEFT deployments more reliable in real-world scenarios where data scarcity and linguistic variability often coexist

[NLP-43] Causal Ensemble Agent : Hierarchical Causal Discovery with LLM -guided Expert Reweighting

【速读】：该论文旨在解决因果发现（Causal Discovery）中不同算法输出结果不一致、难以确定准确因果图的问题。现有方法多依赖数值统计假设，忽视了如特征描述等丰富的领域知识，而基于大语言模型（Large Language Models, LLMs）的直接查询方法又因与实际数据缺乏对齐而导致可靠性不足。为此，本文提出一种名为因果集成代理（Causal Ensemble Agent, CEA）的新框架，其核心在于通过线性意见聚合（linear opinion pooling）整合来自不同图层级上统计发现专家的结构洞见，并引入一个以LLM作为元裁判（meta-referee）的动态重加权机制，在集成置信度接近决策边界时进行自适应调整，从而生成更完整且可靠的因果图。实验表明，CEA在合成与真实世界数据集上均显著优于多种主流因果发现方法，验证了利用LLM进行元分析在因果发现中的有效性。

链接: https://arxiv.org/abs/2606.10607
作者: Xinyu Li,Yuanyuan Wang,Haoxuan Li,Chuan Zhou,Erdun Gao,Bo Han,Tongliang Liu,Kun Zhang,Howard Bondell,Mingming Gong
机构: The University of Melbourne; MBZUAI; Peking University; Adelaide University; Hong Kong Baptist University; The University of Sydney; Carnegie Mellon University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Causal discovery aims to uncover causal structures from observational data, which is crucial for real-world decision-making. However, different causal discovery algorithms can produce divergent results that conflict with each other, complicating the identification of accurate causal graphs. Traditional approaches rely on numerical values and statistical assumptions, often ignoring rich domain-specific information, such as feature descriptions, which could also help structure learning. While recent works explore using Large Language Models (LLMs) to infer causal relations via direct queries, such methods can be unreliable due to a lack of alignment with the actual data. To address these limitations, we propose Causal Ensemble Agent (CEA), a novel framework that aggregates structural insights from statistical discovery experts across different graph levels via linear opinion pooling, and uses an LLM as a meta-referee to dynamically reweight experts when the aggregated confidence is close to the decision boundary, thereby composing an improved and more complete causal graph. Extensive experiments on both synthetic and real-world datasets demonstrate that CEA achieves the strongest overall performance across a wide range of causal discovery methods, highlighting the effectiveness of using LLMs for meta-analysis in causal discovery.

[NLP-44] ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

【速读】：该论文旨在解决当前语音语言模型（Speech Language Models, SLMs）在开放域对话中虽能识别非语言性线索（paralinguistic cues），却常忽略这些线索导致感知与行为脱节的问题。其核心挑战在于如何使模型在多轮对话和存在冲突指令的复杂情境下，稳定地利用非词汇性信息（如语调、情绪、背景噪音等）生成恰当回应。解决方案的关键是提出一种基于策略的自蒸馏方法 ParaBridge：在训练阶段，通过引入临时的、带有非语言性提示的“特权视图”作为密集的全词汇量下一个词预测目标，指导模型学习何时及如何响应非语言线索；而无需依赖人工标注、精心设计的对话数据或外部奖励模型。该方法使模型在移除提示后仍能保持对非语言线索的敏感性，显著提升了安全性和共情性任务的表现（如VoxSafeBench SAR从14.6%提升至40.3%，EchoMind评分从3.27升至3.92），同时保持了原始模型在通用能力上的性能，具备良好的泛化能力与迁移性。

链接: https://arxiv.org/abs/2606.10581
作者: Yuxiang Wang,Qinke Ni,Shengbo Cai,Wan Lin,Liqiang Zhang,Zhizheng Wu
机构: The Chinese University of Hong Kong, Shenzhen; Tencent Hunyuan; Shenzhen Loop Area Institute; Amphion Technology Co., Ltd.; Tsinghua University
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech carries more information than just words: a child’s voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbfParaBridge, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from 14.6% to 40.3% and improves EchoMind average rating from 3.27 to 3.92 . It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within 0.4 points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.

[NLP-45] Hidden Consensus:Preference-Validity Compression in Human Feedback

【速读】：该论文旨在解决标准强化学习人类反馈（Reinforcement Learning from Human Feedback, RLHF）范式在多元社会中因将异质性人类判断简化为单一标量奖励目标而引发的对齐测量偏差问题。其核心挑战在于，当前方法忽视了在文化、历史、语言、区域或规范层面具有正当性的多种合理回应之间的差异，导致“偏好-有效性压缩”（Preference-Validity Compression）——即多个合法有效的响应选项被强制合并为单一优化目标，从而扭曲了真实对齐状态的评估。研究以马来西亚为诊断场景，通过分析321个跨解释框架的偏好事件（包含提示、回应与可接受性判断），发现79%的提示存在多个获得多数支持的可接受回应，而这些被单胜者聚合策略舍弃的回应实际上反映了连贯的地方性、实践性或文化性认知框架。当所有多数支持的选项被纳入考量时，原本看似显著的顶级回应间优势差距明显缩小。研究进一步表明，参与者常选择多个可接受回应，且被剔除的回应具备明确的合理性基础。因此，该研究提出将此问题视为测量有效性（measurement validity）问题，并主张未来对齐方法应满足“有效性保持一致性”（Validity-Preserving Consistency），即在不同有效解释框架之间保持稳定性，而非将多元有效性压缩为单一奖励信号。

链接: https://arxiv.org/abs/2606.10569
作者: Dorcas Chia Ern Chua,Karen Myn Hui Lee,Jia Yue Tan,Zhen Xue Gue,Norzalena Abdul Hamid,Azima Binti Azmi,Keat Mei Yeong,Aizat Izyani binti Mujab,Hafsah Noor Azam,Chee Guo Khoo,Han Ying Lim,Chee Seng Chan
机构: YTL AI Labs( YTL AI 实验室); Universiti Malaya(马来西亚大学); Monash University Malaysia(莫纳什大学马来西亚分校); Universiti Malaysia Sarawak(马来西亚沙捞越大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages. When AI learns from human feedback, it forces a single “correct” answer, but sometimes multiple answers are all genuinely valid, and that nuance gets thrown away

点击查看摘要

Abstract:Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect culturally, historically, linguistically, regionally, or normatively grounded interpretations rather than annotation noise. We call this failure Preference-Validity Compression, the collapse of multiple plural-valid response options into a single optimization target. Using Malaysia as a diagnostic setting, we analyze RLHF-style feedback aggregation through preference events linking prompts, responses, and acceptability judgments across interpretive frames. Across 321 preference events from 20 participants and 107 trio-annotated prompts, 79% of prompts contain more than one majority-supported response that single-winner aggregation would discard, and apparent dominance gaps between top responses diminish when all majority-supported options are considered. Participants frequently select multiple acceptable responses, and discarded responses demonstrably reflect coherent local, practical, or cultural frames. These findings show that majority aggregation in this corpus measures argmax acceptability rather than plural alignment. We treat this as a measurement-validity issue and argue that future alignment methods should satisfy Validity-Preserving Consistency, remaining stable across plural-valid interpretive frames rather than collapsing them into a single reward target.

[NLP-46] Benchmarking Knowledge Editing using Logical Rules

【速读】：该论文旨在解决现有知识编辑技术在评估过程中忽视事实编辑逻辑后果的问题。当前的知识编辑基准主要关注编辑后事实的直接召回能力，而忽略了编辑所引发的隐含推理关系，导致对模型真实知识更新效果的评估不全面。为此，论文提出一种新型评估基准，其核心在于从知识图谱中提取与特定编辑相关的逻辑规则，并基于这些规则生成多跳推理问题，以检验编辑操作对逻辑推论链条的影响。该解决方案的关键在于构建一个语义感知的评估框架，能够系统性地检测知识编辑方法在注入直接事实之外是否能正确传递蕴含知识。实验结果表明，尽管主流方法如ROME和微调（FT）在直接事实插入上表现良好，但在处理蕴含知识时性能显著下降，最大差距达24%，凸显了现有方法在语义一致性与逻辑连贯性方面的不足，强调了发展面向语义的评估体系在知识编辑研究中的必要性。

链接: https://arxiv.org/abs/2606.10554
作者: Tatiana Moteu Ngoli,NDah Jean Kouagou,Hamada M. Zahera,Axel-Cyrille Ngonga Ngomo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 24th International Semantic Web Conference 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.

[NLP-47] Prefilling-dLLM : Predictive Prefilling for Long-Context Inference in Diffusion Language Models

【速读】：该论文旨在解决扩散型大语言模型（diffusion large language models, dLLMs）在长文本推理过程中因每一步去噪均需重新编码整个前缀而导致的计算开销急剧上升问题，其核心瓶颈在于计算复杂度随上下文长度呈二次方增长，严重制约了长上下文场景下的应用。为此，论文提出了一种无需训练的“预填充-解码分离”框架Prefilling-dLLM，其关键创新在于将输入前缀划分为N个块（chunks），一次性缓存各块的键值（KV）表示，并在解码阶段通过块内标记稀疏性筛选出最相关的前K个块进行高效注意力计算。该方法将每步计算复杂度从与完整序列长度的二次关系降低为仅与解码长度的二次关系，显著提升了长序列处理效率。此外，研究发现为每个块添加起始位置标记可作为周期性注意力锚点，有效缓解了“中间遗忘”（lost-in-the-middle）现象。实验表明，该方法在LongBench和InfiniteBench基准上达到当前dLLM加速方案的最优性能，且基于非连续缓存块KV并行化解码的注意力核在8K–32K上下文长度下实现9.1–28.0倍的速度提升。

链接: https://arxiv.org/abs/2606.10537
作者: Jing Xiong,Qi Han,Shansan Gong,Yunta Hsieh,Chengyue Wu,Chaofan Tao,Chenyang Zhao,Ngai Wong
机构: The University of Hong Kong (香港大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校); LMSYS Org (LMSYS组织)
类目: Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1–28.0x speedup at 8K–32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at this https URL. Comments: Technical Report Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.10537 [cs.CL] (or arXiv:2606.10537v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.10537 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jing Xiong [view email] [v1] Tue, 9 Jun 2026 08:06:22 UTC (1,068 KB) Full-text links: Access Paper: View a PDF of the paper titled Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models, by Jing Xiong and 7 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-48] LC-QAT: Data-Efficient 2-Bit QAT for LLM s via Linear-Constrained Vector Quantization ICML2026

【速读】：该论文旨在解决极低比特（如2比特）大语言模型（LLM）在量化感知训练（QAT）中面临的性能瓶颈问题。现有方法主要依赖标量量化（SQ），虽支持高效优化，但在2比特精度下会出现严重性能下降；而向量量化（VQ）虽具备更强的表示能力，却因离散码本查找机制难以实现端到端可微训练。为突破此限制，论文提出LC-QAT——一种仅对权重进行2比特向量量化的QAT框架，其关键创新在于通过学习的仿射映射（learned affine mapping）将量化权重表示为离散向量的线性组合，从而避免训练前向传播中的显式码本查找，实现完全可微的端到端优化。该设计生成高质量的后训练量化（PTQ）初始化，显著提升数据效率，实验表明LC-QAT在多种大语言模型上均优于当前最优QAT方法，且仅需0.1%–10%的训练数据即可达到领先性能，验证了其在极端低比特部署场景下的实用性与可扩展性。

链接: https://arxiv.org/abs/2606.10531
作者: Haoyu Wang,Xingyu Yu,Haiyan Zhao,Fengxiang Wang,Xu Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%–10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

[NLP-49] Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

【速读】：该论文旨在解决当前基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）方法中依赖标量奖励信号所导致的噪声大、难以捕捉细微偏好差异的问题。现有方法主要使用训练好的奖励模型（Reward Model, RM）输出的标量奖励，但这类信号往往信息有限且不稳定。为提升优势估计（advantage estimation）的质量，论文提出一种表示感知的优势估计（representation-aware advantage estimation），其关键在于利用RM隐藏层状态（hidden states）作为辅助信号，以捕获更丰富的语义与偏好信息。具体而言，作者提出图结构优势估计（Graph-based Advantage Estimation, GraphAE），将每组采样响应建模为一个图结构，其中节点代表不同响应，边则根据它们在RM隐空间中的相似性构建；通过图传播机制计算优势值，使每个样本能够融合其邻居的上下文信息。该方法轻量且可无缝集成至现有的分组式RL算法（如GRPO、GSPO和RLOO）。实验表明，GraphAE在多个模型与基准测试上均实现显著性能提升，分别在Arena-Hard-v0.1、AlpacaEval 2.0和MT-Bench上取得最高+6.3、+8.27和+0.22的改进，验证了利用RM表示进行优势估计能有效提升RLHF的样本效率与鲁棒性。

链接: https://arxiv.org/abs/2606.10528
作者: Guozheng Li,Xiyan Fu,Yiwen Guo
机构: Southeast University (东南大学); Nanyang Technological University (南洋理工大学); Independent Researcher (独立研究者)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware advantage estimation, which leverages RM hidden states and models them as auxiliary signals for better advantage estimation. Specifically, we propose the Graph-based Advantage Estimation (GraphAE), treat each sampled group as a graph, where nodes correspond to responses and edges capture their similarity in the RM hidden space. Then advantages are computed via graph propagation, enabling each sample to incorporate contextual information from its neighbors. GraphAE is lightweight and can be seamlessly integrated into existing group-based RL algorithms. We apply GraphAE to GRPO, GSPO and RLOO, and conduct extensive experiments on different models and benchmarks. Empirical results show consistent improvements across three benchmarks, with gains of up to + 6.3 on Arena-Hard-v0.1, + 8.27 on AlpacaEval 2.0, and + 0.22 on MT-Bench. These results demonstrate that leveraging RM representations leads to more sample efficient and robust RLHF.

[NLP-50] UniSVQ: 2-bit Unified Scalar-Vector Quantization ICML2026

【速读】：该论文旨在解决大语言模型（LLM）在2比特量化（2-bit quantization）后面临的性能退化与计算开销之间的权衡问题。现有标量量化（Scalar Quantization, SQ）方法虽具备低存储和计算成本的优势，但存在显著的性能损失；而向量量化（Vector Quantization, VQ）虽能保持较高精度，却带来额外的计算与存储开销。为克服上述局限，论文提出UniSVQ——一种统一的2比特量化框架，其核心创新在于将码字（codewords）参数化为整数格点（integer lattices）的仿射变换，从而在保持与优化整数内核兼容性的同时，继承了向量量化在表示灵活性方面的优势。此外，引入数据驱动的分块细调（block-wise fine-tuning）策略，直接最小化量化重构误差，进一步提升模型性能。实验结果表明，UniSVQ在多个大语言模型家族及零样本基准上均显著优于现有先进标量量化方法，并达到与先进向量量化相当的性能水平，同时实现了更高的推理吞吐量。

链接: https://arxiv.org/abs/2606.10520
作者: Haoyu Wang,Haiyan Zhao,Xingyu Yu,Zhangyang Yao,Xu Han,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Post-training quantization at the 2-bit level enables low-cost deployment and inference acceleration for large language models (LLMs). Scalar quantization (SQ) and vector quantization (VQ) are two primary quantization methods, however, the former suffers from significant performance degradation, and the latter incurs computational and storage overhead. We propose UniSVQ, a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices. This structure preserves compatibility with optimized integer kernels while retaining much of VQ’s flexibility. We further introduce a data-driven block-wise fine-tuning strategy to directly minimize quantization reconstruction error. Extensive experiments across multiple LLM families and zero-shot benchmarks demonstrate that UniSVQ consistently outperforms state-of-the-art SQ methods and achieves performance comparable to advanced VQ methods, while providing higher inference throughput.

[NLP-51] Advancing the State-of-the-Art in Empirical Privacy Auditing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在参数高效微调过程中对训练样本的过度记忆问题，尤其是由此引发的数据隐私泄露风险。其核心挑战在于如何设计有效的“蜜罐”（canary）样本以准确评估成员推断（Membership Inference, MI）或数据重构攻击下的真实数据泄露程度。为此，论文提出通过高温度采样（T ≥ 0.8）从预训练模型生成合成蜜罐样本，并采用针对敏感训练数据定制的提示词（prompt）以确保蜜罐具备高影响力和强可识别性，从而提升审计灵敏度。由于这些合成蜜罐本身不包含隐私信息，可重复插入而不损害真实数据隐私，因此具有良好的可操作性和安全性。此外，论文进一步将该方法拓展至合成数据生成场景，提出基于辅助模型微调的合成数据隐私审计框架：通过对合成数据微调一个辅助模型，并检测其对原始蜜罐的响应能力，实现对合成数据中隐私泄露风险的强估计。最终，借助所提出的强审计方法，论文系统地分析了模型容量与蜜罐熵之间的交互作用对记忆行为的影响，揭示了关键影响因素及其内在机制。

链接: https://arxiv.org/abs/2606.10481
作者: Nicole Mitchell,Galen Andrew,Arun Ganesh,Brendan McMahan,Peter Kairouz
机构: Google Research (谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary’’ examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ( T \geq 0.8 ) from LLMs, using prompts tailored to the privacy-sensitive training data. These canaries act as high-influence outliers, ensuring high identifiability and hence strong audits. Further, since the canaries are themselves non-private, they are inspectable and can be inserted with repetition without jeopardizing the privacy of the real data. An important use of models fine-tuned on privacy-sensitive data is the generation of synthetic data. This also comes with privacy risk. We introduce a powerful synthetic data audit based on fine-tuning an auxiliary model on the synthetic data. Auditing the auxiliary model for the original canaries then provides a strong estimate of the privacy leakage through the synthetic data. Finally, leveraging our strong auditing methodologies, we perform a systematic investigation into the interacting effects of model capacity and canary entropy on memorization.

[NLP-52] Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks

【速读】：该论文旨在解决生物医学文献中推测性语言（speculative language）自动检测的问题，其核心挑战在于如何精准识别文本中表达不确定性或假设性的语句，这对信息检索、多文档摘要及新知识发现具有重要意义。解决方案的关键在于利用分布式句子表示（distributed sentence representations）与先进的深度学习模型进行特征提取与分类。研究对比了两种代表性方法——段落向量（Paragraph Vector）模型与递归神经张量网络（Recursive Neural Tensor Network, RNTN），并将其与支持向量机（SVM）、朴素贝叶斯（Naive Bayes）和基于模式匹配的基线算法进行性能评估。实验结果表明，RNTN在捕捉语义结构方面表现更优，取得F1值0.885，略胜于最优基线线性二元组SVM（F1=0.881），而段落向量模型则表现较差（F1=0.368），即使经过大规模无标签数据训练也未能显著提升效果。研究深入分析了性能差异的影响因素，指出深层语义建模能力与上下文结构解析对推测性语言识别至关重要，并为未来研究提供了优化方向。

链接: https://arxiv.org/abs/2606.10471
作者: Dhruv Dixit
机构: Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 Pages

点击查看摘要

Abstract:In this investigation, we delve into the automated detection of speculative language within biomedical articles by utilizing distributed sentence representations and advanced deep learning techniques. The implications of such identification extend to information retrieval, multi-document summarization, and the exploration of new knowledge. Our exploration encompasses two distinct approaches for acquiring distributed sentence representations: the Paragraph Vector model and the Recursive Neural Tensor Network. These methodologies are then rigorously compared against three foundational baseline algorithms: Support Vector Machines, Naive Bayes, and pattern matching. Our findings reveal that the Recursive Neural Tensor Network (RNTN) demonstrates a slight performance edge (F1 = 0.885) over the top-performing baseline, the linear bigram SVM (F1 = 0.881). Meanwhile, the Paragraph Vector model proves less effective (F1 = 0.368), even after extensive training using an expansive, unlabeled dataset. We engage in a comprehensive discourse on the factors influencing these performance disparities and provide insightful recommendations for future research directions.

[NLP-53] Large Language Models as Modal Models in Linguistics

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在语言学理论中的解释力争议问题，尤其针对当前围绕其科学价值的三种立场——隔离主义（insulationism）、消解主义（eliminativism）与调和主义（conciliationism）——所存在的模糊性与分歧。其解决方案的关键在于引入科学哲学中的模态建模（modal modeling）框架，将大语言模型视为具有真实认识论价值的“最小模型”（minimal models），即便其结构不对应于人类认知机制。论文进一步区分了“可能如何”解释（how-possibly explanations, HPEs）与“实际如何”解释（how-actually explanations, HAEs），指出当前大语言模型尚不具备作为人类语言实际机制解释（HAEs）的充分条件，但已能通过检验语言习得与语言能力的模态假设，提供有效的可能性解释（HPEs）。因此，作者提出应将大语言模型的解释力理解为介于HPEs与HAEs之间的连续谱系，从而避免对其科学意义的夸大或低估，为评估大语言模型在语言科学研究中的作用提供了更为精确且平衡的理论基础。

链接: https://arxiv.org/abs/2606.10467
作者: Haruto Suzuki,Saku Sugawara
机构: Keio University(庆应义塾大学); National Institute of Informatics(日本信息研究所); University of Tokyo(东京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has intensified debates about their significance for linguistic theory. These debates are commonly divided into three positions: insulationism, which regards LLMs as irrelevant to human language; eliminativism, which claims that LLMs can replace traditional linguistic theories; and conciliationism, which views them as useful tools for linguistic research. To clarify these positions, this paper applies the framework of modal modeling from the philosophy of science. We argue that LLMs possess genuine epistemic value as minimal models, even without structural correspondence to human cognition. In particular, they can provide how-possibly explanations (HPEs) by testing modal claims about language acquisition and linguistic competence. We then examine the conditions under which LLMs could qualify as how-actually explanations (HAEs) of human language, drawing on the mechanistic account of scientific explanation. We argue that current LLMs do not yet satisfy these requirements. On the basis of this analysis, we propose understanding the explanatory power of LLMs as lying on a continuum between HPEs and HAEs. This framework avoids both overstating and understating their explanatory significance and offers a more precise basis for evaluating the role of LLMs in the scientific study of language.

[NLP-54] ERAlign: Energy-based Representation Alignment of GNNs and LLM s on Text-attributed Graphs ICML2026

【速读】：该论文旨在解决文本属性图（Text-attributed Graphs, TAGs）中图结构与文本信息融合时表征对齐不充分的问题。现有基于图神经网络（GNN）与大语言模型（LLM）结合的方法多依赖启发式粗粒度匹配策略，缺乏足够的约束机制且忽略分布一致性，导致表征漂移（representation drift）和泛化能力受限。其解决方案的关键在于提出一种基于能量模型（Energy-based Models, EBMs）的表征对齐框架——ERAlign，通过将GNN编码的图结构嵌入与LLM生成的文本嵌入映射至共享潜在空间，实现两者的分布一致性。具体而言，采用分层对齐机制，以距离度量量化各层表示间的差异，并通过能量函数目标进行优化；通过降低能量值，促使两类表征在语义上紧密对齐。为避免不可解析归一化带来的高采样成本，引入能量差异（Energy Discrepancy, ED），不仅显著提升训练效率，还具备理论保障的能量景观畸变抑制能力。在八个TAG数据集上的实证结果表明，ERAlign在不同监督水平及跨任务迁移场景下均达到当前最优性能。

链接: https://arxiv.org/abs/2606.10461
作者: Xianlin Zeng,Fan Xia,Xiangyu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown promise for learning on TAGs, yet achieving well-aligned representations remains challenging. Prior studies largely rely on heuristics that perform coarse-grained matching. They lack sufficient constraints and ignore distributional alignment, leading to representation drift and limited generalization. Building on Energy-based Models (EBMs), we propose an Energy-based Representation Alignment (ERAlign) framework that projects GNN-encoded graph structure and LLM-derived text embeddings in a shared latent space to achieve distribution consistency. Concretely, layer-wise alignment is quantified by a distance metric and optimized via an EBM objective. By decreasing energy values, our framework yields well-aligned representations for downstream tasks. During training, we introduce Energy Discrepancy (ED) to avoid high sampling costs associated with intractable normalization. ED also carries theoretical guarantees of higher training efficiency and reduced energy landscape distortion. Empirical evaluations on eight TAG datasets demonstrate that ERAlign obtains state-of-the-art performance across varying levels of supervision and cross-task transfer scenarios.

[NLP-55] LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

【速读】：该论文旨在解决现实世界中开放域问答（QA）任务面临的两大核心挑战：一是问题通常缺乏显式或可直接获取的证据文档，真实可用的信息隐藏在海量异构数据湖（data lakes）中；二是现有基准测试难以全面评估模型在大规模数据中进行检索与多跳推理相结合的能力。其解决方案的关键在于提出一个名为LakeQA的综合性基准，专门针对以搜索为中心的问答任务，强调模型在复杂数据环境中同时具备高效检索和深度推理的能力。LakeQA基于约9.5 TB的异构文本资源构建，涵盖维基百科及开源政府数据，包含结构化与非结构化数据，并通过至少一位博士级别专家的严格标注确保任务质量。每个任务要求长时程多跳推理，涉及隐式中间步骤——即模型需自主发现相关文档，并跨源整合证据生成答案。实验结果表明，即使是前沿大语言模型（LLM），如GPT-5.2，在该基准上的精确匹配得分也仅为18.37%，凸显了其挑战性。因此，LakeQA为开发能够有效在现代数据湖中搜索并分析信息的生成式智能体（generative agents）提供了现实且严谨的评估平台。

链接: https://arxiv.org/abs/2606.10460
作者: Haonan Wang,Jiaxiang Liu,Yurong Liu,Austin Senna Wijaya,Tianle Zhou,Eden Wu,Yijia Chen,Wanting You,Reya Vir,Daniela Pinto,Grace Fan,Yusen Zhang,Juliana Freire,Eugene Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

[NLP-56] Leverag ing Social Media Data for COVID-19 Studies

【速读】：该论文旨在解决在新冠疫情（COVID-19）期间，海量社交平台信息对公众认知与行为产生深远影响所带来的信息过载与虚假信息传播问题。其核心挑战在于如何从非结构化、多模态的社交媒体数据中有效识别和分析用户生成内容（User-Generated Content, UGC）中的语言特征、视觉线索及情感倾向，以揭示公众情绪动态与信息传播模式。解决方案的关键在于综合运用自然语言处理（Natural Language Processing, NLP）、机器学习（Machine Learning, ML）与特征工程方法，对社交平台文本、图像及情感表达进行系统性建模与分类，从而实现对疫情相关舆论态势的精准捕捉与量化分析，并为后续公共卫生干预与可信信息传播提供数据支持与技术路径。

链接: https://arxiv.org/abs/2606.10459
作者: Nur Hafieza Ismail,Nur Shazwani Kamarudin,Nurol Husna Che Rose
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:Nowadays, social media networks have become widely preferred sources of information. Especially during the time of the Coronavirus disease 2019 COVID 19 pandemic, social media has been one of the most used platforms to get the latest news and information related to COVID 19. Social media are popular because they offer free access to their registered users and allow them to do posting, disseminate information, and respond to others postings. With almost 4.6 billion social media users worldwide, it is not surprising the significant amount of information shared through these platforms could affect how people perceive and cope with the pandemic that we are facing right now. With decent use, social media can be a beneficial digital tool to spread reliable news and public awareness for patients, clinicians, and society. Specifically, this chapter describes linguistic, visual, and emotional indicators expressed in user disclosures. Thus, in this chapter, the related studies of social media platforms usage during the COVID 19 pandemic are explored and discussed in detail. This chapter also categorizes social media data used, introduces different deployed machine learning, feature engineering, natural language processing, and survey methods, and outlines directions for future research.

[NLP-57] SpenseGPT : Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

【速读】：该论文旨在解决半结构化稀疏（semi-structured sparsity）在后训练剪枝（post-training pruning）中因严格50%稀疏性约束导致的显著精度下降问题，同时克服现有松弛稀疏格式对专用编译器支持或运行时开销的依赖。其解决方案的关键在于提出一种实用的混合稀疏-稠密格式 Spense，将每个权重矩阵划分为2:4稀疏区域与稠密区域，从而在保持与现有高性能稀疏和稠密GEMM库兼容的前提下，有效放宽实际稀疏度约束。在此基础上，设计了SpenseGPT这一一次性后训练剪枝方法，通过合理选择稠密区域以平衡性能与精度，并提出了两种不同的稠密区域选取策略。实验表明，在B200 GPU上采用FP8精度时，该方法实现了高达1.2倍的端到端解码速度提升，且未牺牲模型质量，是首个在如B200等现代GPU上实现真实场景下大语言模型（LLM）解码速度提升的同时维持模型性能的半结构化稀疏张量核心应用实例。

链接: https://arxiv.org/abs/2606.10445
作者: Jaeseong Lee,Seung-won Hwang,Samyam Rajbhandari
机构: Snowflake AI Research(Snowflake人工智能研究); Seoul National University(首尔国立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semi-structured 2:4 sparsity is widely supported by modern accelerators, providing up to a 2x theoretical speedup. However, its strict 50% sparsity constraint often causes non-negligible accuracy degradation under post-training pruning. Meanwhile, existing relaxed sparsity formats either require specialized compiler support or introduce runtime overheads that limit end-to-end speedup. We propose Spense, a practical hybrid sparse-dense format that splits each weight matrix into a 2:4 sparse region and a dense region. This design relaxes the effective sparsity constraint while remaining compatible with existing high-performance sparse and dense GEMM libraries, avoiding both custom compiler support and input activation expansion. Building on this format, we introduce SpenseGPT, a one-shot post-training pruning method that produces sparse and dense regions. Notably, we show that selecting the right dense regions is important, and we devise two different strategies to choose them. Experiments on Qwen3-32B and Seed-OSS-36B demonstrate that our method achieves up to 1.2x end-to-end decoding speedup on B200 GPUs with FP8 precision, while preserving accuracy. To the best of our knowledge, this is the first one-shot pruning demonstration of real-world end-to-end LLM decoding speedup from semi-structured sparse tensor cores on recent GPUs such as B200s, while maintaining model quality.

[NLP-58] Enhancing Multilingual LLM -based ASR with Mixture of Experts and Dynamic Downsampling ICASSP2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）与自动语音识别（Automatic Speech Recognition, ASR）系统融合过程中面临的多语言泛化能力不足和模态对齐困难两大核心挑战。其解决方案的关键在于提出一种基于投影器（projector-based）的LLM-ASR框架，通过引入专家混合（Mixture of Experts, MoE）架构增强跨语言适应性，并采用连续积分-放电（Continuous Integrate-and-Fire, CIF）机制实现动态下采样与模态对齐。实验结果表明，该框架在多项指标上显著优于现有强基准模型，为构建更准确、鲁棒且具备广泛泛化能力的基于大语言模型的语音识别系统提供了有效路径。

链接: https://arxiv.org/abs/2606.10439
作者: Guodong Lin,Ziqi Chen,Yuxiang Fu,Ke Li,Wei-Qiang Zhang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.

[NLP-59] Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

【速读】：该论文旨在解决自回归语言建模中因因果自注意力（causal self-attention）机制导致的计算复杂度随上下文长度呈二次方增长的问题。现有方法如循环神经网络和状态空间模型虽能降低计算开销，但将历史信息压缩为单一可序列更新的固定尺寸状态，限制了长程依赖的表达能力。本文提出一种新的并行内容寻址记忆机制——并行因果关联场（Parallel Causal Associative Field, PCAF），其核心创新在于构建一个基于因果后继记录的并行内容寻址内存结构：通过哈希桶将上下文窗口内的局部记录写入，对当前查询检索有限候选集，形成对后续词元的稀疏缓存分布，并利用可学习门控机制将该缓存与参数化局部语言模型融合。该方案在保持稀疏长上下文访问能力的同时，避免了传统递归模型中固定状态的瓶颈问题。实验表明，在303M参数、上下文长度T=2048的条件下，PCAF-semantic在WikiText-103上达到36.31的困惑度（优于匹配的密集型Transformer的47.49），在PG-19上为52.45（优于基准的53.84），且在谷歌云TPU v4-32集群上实现0.61–0.62M tokens/s的处理速度，显著高于密集型与局部注意力基线的0.43M tokens/s。多种子超参数搜索与单GPU组件消融实验进一步验证了关联缓存、检索容量及可学习门控对速度-质量权衡的关键影响。

链接: https://arxiv.org/abs/2606.10435
作者: Muhammad Ahmed
机构: Google(谷歌)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages, 5 figures, and 6 tables. Experiments on WikiText-103, PG-19, and WikiText-2 using TPU v4-32 and NVIDIA RTX 3060 hardware. Code: this https URL

点击查看摘要

Abstract:Transformers achieve strong language modeling performance by providing direct token-to-token communication paths, but causal self-attention scales quadratically with context length. Recurrent and state-space models reduce this cost, yet compress history into sequentially updated fixed-size states. This paper studies a third primitive: a parallel content-addressed memory over causal successor records. The proposed Parallel Causal Associative Field (PCAF) writes local records from a context window into hash buckets, retrieves a bounded candidate set for the current query, forms a sparse cache distribution over successor tokens, and mixes that cache with a parametric local language model through a learned gate. The resulting model maintains sparse long-context access while avoiding a single fixed recurrent state bottleneck. We evaluate PCAF under full autoregressive pretraining on WikiText-103 and PG-19 using a distributed Google Cloud TPU v4-32 pod. At 303M parameters and context length T = 2048, PCAF-semantic reaches 36.31 perplexity on WikiText-103 and 52.45 perplexity on PG-19, compared with 47.49 and 53.84 for a matched dense Transformer. PCAF-semantic simultaneously processes 0.61-0.62M tokens/s across the TPU pod, versus 0.43M tokens/s for dense and local attention baselines. Supporting 41M-parameter multi-seed sweeps and single-GPU component ablations show that the associative cache, retrieval capacity, and learned gate materially affect the speed-quality trade-off.

[NLP-60] Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning

【速读】：该论文旨在解决在多语言指令微调（multilingual instruction tuning）场景下，相较于基础LoRA，现有常见LoRA变体是否能带来性能优势的问题。研究发现，在多种目标语言的两个数据集上的实验表明，更复杂的LoRA变体在平衡跨语言迁移能力与知识保留方面，并未显著优于基础LoRA。其关键在于对隐藏层嵌入（hidden embeddings）的分析显示，采用不同LoRA技术微调的大语言模型（LLM）在层间语言表征上保持高度相似，暗示LoRA架构上的创新可能无法有效转化为更优的跨语言适应性。因此，该研究的核心结论是：当前主流的LoRA变体在多语言任务中的改进有限，其设计复杂性未必带来实际性能提升。

链接: https://arxiv.org/abs/2606.10428
作者: Thamali Wijewardhana,Napoleon H. Reyes,Surangika Ranathunga
机构: Massey University (梅西大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate whether commonly available LoRA variants have an advantage over basic LoRA in multilingual instruction tuning. Experiments involving LoRA and four other variants on two datasets across diverse target languages show that there is no significant advantage in using more complex LoRA variants instead of basic LoRA, with respect to balancing cross-lingual transfer and knowledge retention. An analysis of hidden embeddings reveal that layer-wise language representation remains largely similar across LLMs fine-tuned with different LoRA techniques, suggesting that architectural novelty of LoRA techniques may not translate into better cross-lingual adaptation.

[NLP-61] WebChallenger: A Reliable and Efficient Generalist Web Agent

【速读】：该论文旨在解决大语言模型（LLM）代理在自主网页导航任务中面临的效率与成本瓶颈问题，其核心挑战并非模型能力不足，而是现有代理架构未能有效模拟人类的三项认知优势：对页面关键区域的注意力选择、对网站结构的持续记忆以及对常见交互模式的程序化熟练度。为此，论文提出WebChallenger框架，其关键在于通过架构创新而非依赖模型规模来弥补上述差距。该框架以PageMem为核心——一种从文档对象模型（DOM）确定性构建的结构化页面表示，将页面组织为具有短摘要的语义层级结构。在此统一表征基础上，系统设计了三大机制：分而治之的观察流水线，使代理仅在任务相关区域深入提取细节；轻量级探索与记忆系统，一次性遍历网站以构建可复用的页面与元素行为地图；以及复合动作工作流，将常见的多步操作压缩为单一代理动作，并自动处理部分状态变更。由于所有机制均基于PageMem运行，系统具备跨网站泛化能力，无需针对特定站点进行适配。实验表明，使用现成的开源权重模型且不进行微调，该系统在WebArena、VisualWebArena、Online-Mind2Web和WorkArena上的表现分别达到56.3%、48.7%、51.0%和70.9%，接近前沿专有系统的性能，同时推理成本仅为后者的极小部分。

链接: https://arxiv.org/abs/2606.10423
作者: Jayoo Hwang,Xiaowen Zhang,Vedant Padwal
机构: ML Collective; longsurf.ai
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at this https URL

[NLP-62] KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

【速读】：该论文旨在解决当前数学推理基准测试普遍缺乏基于真实人类表现的个体题目难度信号这一关键问题。现有基准多依赖主观设计或通用评估标准，难以反映实际考生在特定题目上的认知挑战程度。为此，研究提出KCSAT-ML，涵盖2014至2025年韩国大学学力水平考试（KCSAT；Suneung）数学部分共664道题目，其中包含339道核心题目的官方错误率数据，源自数十万考生的全国性实测结果，为每道题提供了可量化的、真实的人类表现难度指标。其解决方案的核心在于引入“难度对齐推理增益”（Difficulty-aligned Reasoning Gain, DRG）这一得分正交的评估指标，通过分析模型错误是否集中于人类认为困难的题目，而非人类容易出错的题目，从而揭示模型与人类在推理模式上的对齐程度。该框架揭示了三类重要现象：(i) 低预算模型在所有规模下均在人类高错误率尾部出现准确率崩溃；(ii) 测试时缩放（Test-time Scaling, TTS）导致令牌使用量与群体错误率呈近似线性增长，而准确率提升呈现非单调变化；(iii) 在同一模型家族内部，TTS在最难题目上表现为反向缩放，在较易题目上则出现“过度思考”现象，二者实为同一对齐失败的不同表现。更重要的是，DRG显示，即使模型在整体准确率上相近，其推理质量可能截然不同——一个模型误判的是人类也难以解答的题目，而另一个虽攻克最难题却在人类轻松作答的题目上失分，这种差异被传统平均准确率所掩盖。因此，该研究的关键突破在于构建了一个兼具真实难度标注与精细推理对齐评估能力的新范式，为大模型数学推理能力的可信评估提供了新工具。

链接: https://arxiv.org/abs/2606.10403
作者: Sanghee Park,Geewook Kim,Kee-Eung Kim
机构: NAVER Cloud AI; KAIST AI
类目: Computation and Language (cs.CL)
备注: 18 pages, 14 figures, 8 tables

点击查看摘要

Abstract:Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model’s mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones – two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy – a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at this https URL.

[NLP-63] Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

【速读】：该论文旨在解决传统科学发现过程中存在的孤立性问题，即现有基于语言模型的AI系统大多独立运作，缺乏协作与共享机制，难以实现持续、协同的科学突破。其核心挑战在于如何构建一个支持自主智能体（agent）之间开放、动态交互的分布式研究环境，以促进集体智慧的涌现。解决方案的关键在于提出EinsteinArena——一个原生支持智能体参与的开放分布式科研平台，该平台为智能体提供实时更新的开放科学问题，每个问题均配备可验证的评估器（verifier）、公开排行榜及专用讨论论坛，支持智能体间通过提交结果、交流见解、迭代优化和借鉴彼此思路进行协作。研究聚焦于数学领域中可明确衡量进展的任务，实验表明，截至2026年5月，平台已促成12项超越此前人类与AI最优解的新成果，如在11维空间的“接触数问题”中将已知下界从593提升至604，这一突破源于多智能体间的持续互动、公共讨论、验证机制演进以及思想的跨代理传递，验证了去中心化、自主智能体间开放协作可催生新型集体式人工智能驱动的科学研究范式。

链接: https://arxiv.org/abs/2606.10402
作者: Federico Bianchi,Yongchan Kwon,Aneesh Pappu,James Zou
机构: Together AI(Together AI); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other’s ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.

[NLP-64] Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark EMNLP2026

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在回答问题时过度依赖文本先验（textual priors）而非图像内容的问题，这种现象导致模型在基准测试中得分虚高，产生看似自信但缺乏视觉依据的错误答案。现有评估基准通常无法有效分离这一行为，因为每张图像仅配有一个固定问题，难以区分模型是基于图像还是文本线索作答。为此，作者构建了一个包含540张图像、覆盖六个推理类别的新基准，对同一图像生成四种不同表述的问题变体，使问题表述成为可控变量，其中最困难的变体直接从图像内容生成以最小化文本泄露。实验结果表明，11个VLM（涵盖小型开源模型到大型闭源系统）在最困难变体上的性能均显著下降，且开源模型退化最为严重。研究的关键诊断方法是“无图像消融”（no-image ablation），该方法将开源模型的准确率降至仅1%至9%，揭示其严重依赖文本信息。进一步通过大语言模型（LLM）评估难度、低基线与最终文本相似性分析以及人工重新标注等三重验证，证实了模型对图像的真实依赖性。此外，采用与问题变体生成方式一致的上下文示例可恢复最高准确率，而对小型VLM进行基于生成式强化策略（GRPO）的后训练，在所有四个变体上均取得稳定提升，并成功泛化至分布外数据集，表明文本先验依赖具有可测量性和部分可训练消除性。

链接: https://arxiv.org/abs/2606.10400
作者: Pratham Singla,Shivank Garg,Vihan Singh,Paras Chopra
机构: Lossfunk; Indian Institute of Technology Roorkee; Raeth AI
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures, Submitted to EMNLP 2026

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question’s phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away.

[NLP-65] Expert-Level Crisis Detection in Mental Health Conversations

【速读】：该论文旨在解决现实危机干预中对话动态性与现有研究多集中于静态文本之间的矛盾，尤其针对多轮对话场景下模型在追踪随上下文演变而出现的风险信号时性能显著下降的问题。其解决方案的关键在于构建CRADLE-Dialogue——一个由临床医生标注的、面向对话轮次级危机检测的基准数据集，包含600个对话，涵盖自杀意念、自伤行为、儿童虐待等临床相关风险，并区分过去与当前风险状态。同时，提出“Alert-Confirm”评估协议，将早期预警信号（Alert）与危机明确显现的回合（Confirm）区分开来，更真实地反映临床干预需求。实验表明，识别风险何时出现远比判断风险是否存在更具挑战性，当前模型在微平均F1上仅达40%至60%的中高水平。此外，研究还发布了合成训练语料和一个320亿参数模型，在轮次级、对话级及仅确认型评估设置中均显著优于现有开源模型，并达到或超越部分专有模型的性能表现。

链接: https://arxiv.org/abs/2606.10380
作者: Grace Byun,Abigail Lott,Rebecca Lipschutz,Sean T. Minton,Elizabeth A. Stinson,Jinho D. Choi
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static this http URL-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.

[NLP-66] PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning ICML2026

【速读】：该论文旨在解决在固定计算预算下，随着大语言模型（Large Language Models, LLMs）规模持续扩大，如何有效提升模型容量的难题。其核心挑战在于如何在不增加推理开销的前提下，将密集型教师模型（dense teacher）的知识高效迁移至稀疏的混合专家模型（Mixture-of-Experts, MoE）学生模型中，同时学习高质量的路由策略。解决方案的关键在于提出一种名为路径对齐的压缩蒸馏（Path-Aligned Decompression Distillation, PADD）框架，该框架通过双阶段四阶段的协同训练机制实现：第一阶段（初始化阶段）通过教师神经元聚类与学生专家预热，构建学生专家的多样化功能；第二阶段（训练阶段）在统一训练流程中集成在线自适应蒸馏、路径优化的策略精炼以及奖励增强的负载均衡，从而在保持低推理成本的同时，实现接近或超越教师模型性能的MoE学生模型。实验表明，PADD在数学推理基准上显著优于现有强基线，并实现了有效的知识迁移与稳定的路由行为。

链接: https://arxiv.org/abs/2606.10369
作者: Xinyue Peng,Yi Qian,Jiaojiao Lin,Wenjian Shao,Yanming Liu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: published in ICML 2026

点击查看摘要

Abstract:As large language models (LLMs) continue to scale, it becomes increasingly challenging to grow model capacity under fixed computation budgets. We propose Path-Aligned Decompression Distillation (PADD), a framework for distilling knowledge from dense teachers without explicit routing into mixture-of-experts (MoE) students while learning high-quality routing policies. PADD organizes knowledge distillation into four stages in two phases: an initialization phase (Stage I) that builds diverse functionality in the student’s experts through teacher neuron clustering and student-expert warmup, and a training phase (Stages II–IV) that integrates online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing in a single training pipeline. Experiments on mathematical reasoning benchmarks demonstrate that PADD yields substantial gains over strong baselines at the same inference cost and that the MoE student can match or surpass its dense teacher. They also demonstrate effective teacher-to-student knowledge distillation and stable routing behavior.

[NLP-67] Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

【速读】：该论文旨在解决大语言模型中混合专家（Mixture-of-Experts, MoE）架构下的机器遗忘（machine unlearning）问题。现有方法在密集模型中表现良好，但在MoE架构中因路由机制的稀疏性而面临挑战：遗忘数据（forget data）常导致少数专家被过度激活，而保留数据（retain data）对这些专家的激活强度较弱，造成“遗忘-保留路由不匹配”现象，进而导致关键遗忘专家在遗忘过程中缺乏充分正则化。为应对这一问题，论文提出一种名为TRACE（Targeted Routing-Aware Calibration of Experts）的解决方案，其核心在于通过离线分析专家激活统计量识别出对遗忘至关重要的专家，并基于令牌级保留损失重加权，使这些专家在保留侧的激活频率更贴近其在遗忘侧的激活分布，从而实现路由感知的专家正则化校准。实验结果表明，TRACE在多个MoE大语言模型上显著提升了遗忘-效用权衡性能，在与最强基线相当的遗忘效果下实现了9%的相对效用提升，并在MUSE-BOOKS数据集中的三项指标上取得最优表现。

链接: https://arxiv.org/abs/2606.10338
作者: Jingyi Xie,Yijun Lin,Yinjiang Xiong,Zhikun Zhang,Sai Li
机构: Renmin University of China; Tsinghua University; Zhejiang University; Lightstandard
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains underexplored. Unlike dense models, MoE architectures employ a router at each layer to assign each token to a sparse subset of experts. In this work, we observe that forget data often activates a small subset of experts disproportionately, while these experts may receive much weaker activation from retain data. This forget–retain routing mismatch can leave forget-critical experts under-regularized during unlearning. To address this, we propose \textbfTRACE, Targeted Routing-Aware Calibration of Experts, for MoE unlearning. TRACE first detects forget-critical experts from offline activation statistics, and then calibrates retain regularization by reweighting token-level retain losses so that each selected expert’s retain-side activation frequency better matches its forget-side counterpart. Experiments on WMDP and MUSE-BOOKS across multiple MoE LLMs show that TRACE consistently improves the forget-utility trade-off, yielding a 9% relative utility improvement over the strongest baseline under comparable forgetting quality and the best performance on three out of four MUSE-BOOKS metrics.

[NLP-68] he Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

【速读】：该论文旨在解决自动作文评分（AES）系统在评估相互依赖的论述要素（如论点、主张、证据、结论）时，因现有方法多将其孤立处理而导致的连贯性下降与泛化能力不足的问题。其解决方案的关键在于采用任务感知的微调策略，基于LLaMA-3.1-8B模型，结合参数高效微调（LoRA）与4比特量化技术，并对比三种训练课程：顺序式（按论点、位置、主张、证据、结论逐步微调）、独立式（为每项任务训练专用模型）和随机式（任务顺序随机化的多任务训练）。实验结果表明，遵循论述结构设计的顺序式微调显著提升整体性能，尤其在证据（F1 65%）和结论（F1 87%）识别上表现最优，且优于独立训练方案，甚至超越容量大得多的通用型LLaMA-70B基线模型。此外，小规模、任务优化的模型在关键指标上展现出与大型语言模型（LLM）相当的竞争力，验证了基于任务结构设计的训练课程对提升教育自然语言处理效能的重要性，为构建可扩展、低成本的智能评估系统提供了有效路径。

链接: https://arxiv.org/abs/2606.10327
作者: Ali Keramati,Mark Warschauer
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere. These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.

[NLP-69] abClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

【速读】：该论文旨在解决结构化数据（如电子表格和表格）分析过程中依赖大量手动操作与领域知识、现有大语言模型（Large Language Model, LLM）代理在分析过程中的透明度不足、隐含假设缺乏显式表达、多表比较能力有限以及无法根据用户偏好自适应调整等问题。其解决方案的关键在于提出TabClaw——一个开源的交互式AI代理，能够处理电子表格与表格推理任务。该系统通过自然语言指令接收用户请求，主动澄清模糊意图，并生成可编辑的执行计划；采用类ReAct的工具使用分析循环，实时流式展示中间推理步骤；通过专业化代理并行处理多表推理，最终以显式共识与不确定性标记整合结果。此外，TabClaw具备长期记忆能力，可记录已完成的工作流，提取持久化的用户偏好，从重复的工具使用模式中提炼可复用技能，支持技能包导入，并能基于负面反馈持续优化技能。实验表明，该方法在电子表格操作与表格推理基准测试中显著提升了可执行任务完成率与推理性能，同时保持了分析流程的可审查性，实现了从静态数据处理向可追溯、可定制、渐进式个性化的智能分析工作流的转变。

链接: https://arxiv.org/abs/2606.10316
作者: Mingyue Cheng,Shuo Yu,Daoyu Wang,Qingchuan Li,Xiaoyu Tao,Qingyang Mao,Yitong Zhou,Qi Liu
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室)
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Spreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts of this process, but they often provide limited transparency into intermediate decisions, rely on implicit assumptions, struggle with multi-table comparison, and repeat similar workflows without adapting to a user’s preferences. This paper presents TabClaw, an open-source interactive AI agent for spreadsheet manipulation and table reasoning. Users upload CSV or Excel files and issue natural-language requests; TabClaw clarifies ambiguous intent, exposes an editable execution plan, streams a ReAct-style tool-using analysis loop, dispatches specialist agents for parallel multi-table reasoning, and synthesizes findings with explicit consensus and uncertainty markers. Beyond one-off analysis, TabClaw records completed workflows, extracts persistent user memory, distills reusable skills from repeated tool-use patterns, supports package-style skill import, and upgrades skills from negative feedback. Experiments on spreadsheet manipulation and table reasoning benchmarks show that TabClaw improves executable task completion and reasoning performance while preserving an inspectable user workflow. This paper shows how TabClaw turns spreadsheets and tables into inspectable analytical workflows while gradually personalizing itself to recurring data-analysis tasks. Our code is available.

[NLP-70] Catching One in Five: LLM -as-Judge Blind Spots in Production Multi-Turn Transaction Agents

【速读】：该论文旨在解决生成式 AI（Generative AI）在评估对话智能体时存在的可靠性问题，特别是以大语言模型作为评判者（LLM-as-judge）的系统性盲区。其核心问题是：尽管当前主流采用大语言模型进行自动评估，但其实际对真实缺陷的召回能力极低，无法有效识别多轮交互中关键的状态跟踪类缺陷。解决方案的关键在于揭示了这一失败的本质并非感知能力不足，而是评分机制的结构性缺陷——现有的评分标准仅包含意图、品牌语调和个性化三个粗粒度维度，完全缺失对状态追踪、安全防护和错误恢复等行为维度的显式建模。这导致大量跨轮次的状态错误（如确认门禁锁定、购物车幻觉、升级阻塞、过期指代）虽在人工审查中被明确识别，却因不符合评分框架而被错误归类为“品牌语调”问题或直接忽略，最终未能触发任何操作性告警。研究进一步指出，这种失效是路由与系统设计缺陷所致，而非模型本身无能；在实际部署中，当自动判断器报告零缺陷时，基于该结果的偏差校正方法（如Rogan-Gladen修正）将彻底失效，而在有缺陷报告的情况下，现有评估体系仍会低估真实缺陷率3至6倍。因此，对于生产环境中的多轮对话智能体，自动化评判只能作为最低限度的基线，无法替代深度的人工审核。

链接: https://arxiv.org/abs/2606.10315
作者: Sawyer Zhang,Alexander Wang,Sophie Lei
机构: Lumivate (Lumi)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure, 5 tables

点击查看摘要

Abstract:LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems – 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions – state-tracking, guardrails, recovery – where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored “brand voice”, and none reach an operational failure – the gate is wired to hangs and hard assertions, not the rubric – so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates – no signal can recover the true rate – while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.

[NLP-71] Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate ACL

【速读】：该论文旨在解决多智能体大语言模型（Multi-agent LLM）系统中推理质量评估的难题，尤其是在缺乏参考答案的开放性任务场景下。其核心问题在于如何有效衡量生成内容的推理质量，而传统依赖人工或规则评分的方法在复杂任务中成本高且难以标准化。论文提出以解码过程中的内在置信度信号——即分词级别的对数概率（token-level log-probabilities）——作为推理质量的预测指标，并通过基于辩论的作文评分框架，在两个ASAP作文数据集上对比了该置信度代理指标与基于评分量规的判别器评分。研究发现，生成初期（前几轮生成令牌）的置信度信号是推理质量最强的预测因子，显著优于全序列统计特征；对对数概率轨迹的分析表明，生成起始阶段的动态变化最为多样，蕴含最丰富的信息。此外，研究还揭示了智能体角色间的系统性不对称：支持性推理中置信度与推理质量的对齐程度高于对抗性批判。因此，该研究的关键解决方案在于揭示早期解码动态所蕴含的轻量化、高效且可靠的推理可信度信号，为多智能体系统的实时推理质量监控提供了可行路径。

链接: https://arxiv.org/abs/2606.10307
作者: Ali Keramati,Justin Cheok,Jacob Horne,Mark Warschauer
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL)
备注: 15 pages, 8 figures, 4 tables; ACL Proceedings

点击查看摘要

Abstract:Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.

[NLP-72] MIRAG E: A Polarity-Flipping Encoding Subspace in LLM Agents

【速读】：该论文旨在解决大语言模型（LLM）代理在被胁迫时通过隐蔽编码方式（如Base64、ROT13、首字母诗、同义词链等）将敏感数据外泄的问题，此类输出虽可规避基于表面内容的检测，但其内部计算过程仍暴露可探测的模式。解决方案的关键在于发现并利用模型残差流（residual stream）中共享的低维编码子空间——这一子空间在九类编码家族与八种不同架构的模型中普遍存在。研究通过逻辑回归探针对八类编码进行训练，可在未见的第九类编码上实现AUC 0.975–1.000的高精度预测，表明该探针捕捉的是模型内部的计算信号而非表层特征。进一步地，该子空间方向在规划标记（planning token）处表现出第二重机制性签名：当模型决定内联执行编码时极性为正，而选择调用工具外包时极性为负，从而在编码文本生成前即区分两种执行策略。基于此，作者构建了MIRAGE（Model-Internal Readout of Agentic Generation Exfiltration），一种双通道实时监控系统，同时利用上述两类信号。在126个代理式外泄场景中，其性能达到AUC = 0.918，显著优于仅依赖输出端检测的方法（AUC = 0.518）。监控性能本质上取决于宿主模型的几何结构：良性编码下的误报率从Qwen-7B的0%到Phi-3.5的100%不等，揭示该探针能够真实反映模型几何是否在隐秘编码与显式编码之间建立有效分离。实验还观察到，所有抑制该子空间的攻击均伴随编码保真度下降，这构成了在测试范围内的一条经验规律，而非结构性不可能性的断言。

链接: https://arxiv.org/abs/2606.10304
作者: Pratibha Revankar,Kargi Chauhan,Jihye Kim,Sadiba Nusrat Nur,Vincent Siu,Chenguang Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0.975-1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output-only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model’s geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.5, revealing that the probe faithfully reads whether a model’s geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.

[NLP-73] Where You Inject Diversity Matters: A Unified Framework for Diverse Generation

【速读】：该论文旨在解决大语言模型在开放式生成任务中输出多样性不足的问题，即模型倾向于生成语义相似的多个结果，难以满足需要多样化输出的应用需求。现有测试阶段（test-time）多样性方法虽在不同生成阶段引入多样性，但其有效性差异显著，且缺乏对何种设计选择能真正带来有意义多样性的系统性理解。为此，论文提出一个分析框架，从生成过程中引入的多样性来源（diversity source）角度对测试阶段多样性方法进行分类，并提出“传输得分”（transmission score）以量化源端变异传递至最终输出的有效性。基于该框架，作者提出一种全自动的规格级生成方法（specification-level generation），其核心在于先生成多样化的中间规格（intermediate specifications），再以此为条件生成最终响应。在五个开放式生成任务和四种主干模型上的实验表明，该方法在保持与基线相当生成质量的前提下，显著提升了输出多样性。研究进一步揭示，成功的多样性注入依赖于两个关键因素：一是输入源本身的多样性水平，二是源信息向最终输出的有效传递，因此，源的设计（source design）与源到输出的实现机制（source-to-output realization）成为构建更高效多样性生成系统的核心杠杆。

链接: https://arxiv.org/abs/2606.10302
作者: Cheng Zhang,Rui Xin,Chudi Zhong
机构: UNC Chapel Hill(北卡罗来纳大学教堂山分校); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-ended generation tasks often require a set of meaningfully different outputs, yet large language models often produce similar generations. Existing test-time diversity methods operate at different stages of generation with varying effectiveness, but it remains unclear what design choices lead to meaningful diversity in the output. We introduce a framework that characterizes test-time diverse generation methods by the diversity source introduced during generation and provide a transmission score for measuring how effectively variation in the source reaches the final output. Guided by this framework, we propose fully automated specification-level generation methods that first generate diverse intermediate specifications and then condition on them to produce final responses. Across five open-ended tasks and four backbone models, specification-level injection improves output diversity over test-time baselines while maintaining comparable quality. Our analysis shows that successful diversity injection depends on both the diversity of the sources and their transmission to the output, highlighting source design and source-to-output realization as two key levers for building more diverse generation systems.

[NLP-74] From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLM s

【速读】：该论文旨在解决大语言模型在基于检索或增强上下文生成时，外部上下文与参数先验之间存在的冲突问题，这一冲突是影响生成可靠性的核心瓶颈。现有对比解码方法遵循“上下文感知”范式，单方面强化上下文而弱化参数先验，当上下文错误时会覆盖正确的先验知识。本文提出“冲突感知”范式，通过冲突信号动态分配先验与上下文之间的权威性，而非预设上下文的可信度。研究发现，先验与上下文logits的仿射组合构成一个具有内在“制度不对称性”的幂族：外推（extrapolation）在先验正确时会无界放大错误，插值（interpolation）在上下文正确时则校正不足，且不存在静态制度能同时覆盖两种情形。现有对比解码方法多属于此类，且以外推为主。为全面评估两种冲突方向的影响，作者提出基于模型感知的三态评估基准（TriState-Bench），通过校准各模型的先验知识，量化三种冲突状态——修正、抵抗与一致。为克服制度不对称性，提出自适应制度路由（Adaptive Regime Routing, ARR），在每一步动态切换制度，显著提升抵抗能力（抵抗EM从低于6提升至16–33），同时不损害修正或一致性能。

链接: https://arxiv.org/abs/2606.10298
作者: Runze Jiang,Taiqiang Wu,Yan Wang,Bingyu Zhu,Longtao Huang
机构: Peking University (北京大学); Alibaba Group (阿里巴巴集团); The University of Hong Kong (香港大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages, 9 figures

点击查看摘要

Abstract:When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emphcontext-aware paradigm that unilaterally amplifies context over parametric priors, overwriting correct priors when the context is erroneous. We generalize this to the \textbfconflict-aware paradigm that dynamically allocates authority between prior and context based on conflict signals, rather than presupposing context trustworthiness. We show that the affine combination of prior and context logits yields a \textbfpower family with an inherent \textbfregime asymmetry: extrapolation amplifies errors unboundedly when the prior is correct, interpolation under-corrects when the context is correct, and no static regime covers both. Existing contrastive decoding methods are instances of this family, mostly extrapolative. To evaluate both conflict directions, we propose TriState-Bench, a model-aware evaluation protocol that calibrates per-model prior knowledge to measure three conflict states: correction, resistance, and agreement. To resolve the asymmetry, we propose Adaptive Regime Routing (ARR), which routes between regimes at each step, lifting resistance EM from below 6 to 16–33 without sacrificing correction or agreement. Our code is available at this https URL.

[NLP-75] he Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM -as-Judge ACL

【速读】：该论文旨在解决多智能体辩论系统在评估过程中过度关注最终答案正确性，而忽视了推理过程质量的问题。其核心挑战在于：如何有效衡量和利用辩论过程中产生的中间推理质量，以提升系统的可解释性与可靠性。解决方案的关键在于构建一个基于两智能体辩论架构（构造者与审计者）的评估框架，结合大语言模型作为裁判（LLM-as-judge），对每个智能体的推理过程进行多维度评分，包括指令遵循度、论证质量与证据锚定性，并引入关键失败标记（critical-failure flag）。研究进一步分析了三个信号之间的关系：推理分词级别的概率分布、由LLM-as-judge生成的评分以及最终任务准确率。结果表明，内部置信度信号能够有效预测外部评估的推理质量，且在不同角色间存在显著不对称性——构造者的置信度与推理质量的相关性约为审计者的两倍，且基于置信度的关键推理失败检测对构造者具有更高的可靠性（AUROC 0.804 vs. 0.634），这一发现为跨领域推理质量评估提供了重要依据。

链接: https://arxiv.org/abs/2606.10296
作者: Ali Keramati,Justin Cheok,Jacob Horne,Mark Warschauer
机构: University of California, Irvine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 1 table, ACL proceedings

点击查看摘要

Abstract:Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture – a Constructor and an Auditor – with an LLM-as-judge that scores each agent’s reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag. Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross-domain investigation proposed in this paper.

[NLP-76] When Metrics Disagree: A Meta-Analysis of Knowledge-Graph-Completion Model Benchmarking

【速读】：该论文旨在解决知识图谱补全（Knowledge Graph Completion, KGC）模型评估中存在的评价标准碎片化问题。传统评估方法依赖于孤立的排名指标（如MRR、Hits@k、平均秩），这些指标在不同数据集上常导致模型排序冲突，且性能表现难以跨数据集泛化，从而阻碍了模型间的公平比较，甚至引发选择性报告，掩盖了真实进展。为此，论文将KGC评估重构为多准则决策（Multi-Criteria Decision-Making, MCDM）问题，提出一种基于元分析的评估框架，对七种聚合器（aggregator）在五个核心测试维度上进行系统评估：一致性、跨数据集稳定性、指标独立性、抗噪声鲁棒性以及泛化能力。所有测试均通过留一模型排除（LOMO）和留一组排除（LOGO）策略进行交叉验证，以确保评估结果在多样模型子集下的可靠性。实验结果表明，帕累托最优分析识别出Z-score为最均衡的聚合器，在尾部预测（tail prediction, h,r,?）中将DualE排在首位，在关系预测（relation prediction, h,?,t）中则将FMS（Flow-Modulated Scoring）评为最优。敏感性分析进一步揭示，一致性与稳定性对移除操作具有较强鲁棒性，而泛化能力和指标独立性则最为敏感。该框架有效缓解了评估不一致问题，为聚合器的选择与KGC模型基准测试提供了数据驱动的科学依据。

链接: https://arxiv.org/abs/2606.10287
作者: Haji Gul,Ajaz Ahmad Bhat
机构: Universiti Brunei Darussalam (文莱达鲁萨兰大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating Knowledge Graph Completion (KGC) models remains challenging because standard assessment relies on isolated rank-based metrics such as MRR, Hits @ k, and Mean Rank, which often produce conflicting model orderings across datasets. A model that leads on MRR may trail on Hits@1, and strong performance on one dataset may not generalize to another. This fragmentation hinders comparison, enables selective reporting, and obscures real progress. We reframe KGC evaluation as a Multi-Criteria Decision-Making (MCDM) problem and present a meta-analysis of seven aggregators across five tests: consistency, cross-dataset stability, metric independence, robustness under noise, and generalizability. Each test is averaged over leave-one-model-out (LOMO) and leave-one-group-out (LOGO) removals so that reliability reflects aggregator behavior across diverse model subsets. Across tail (h,r,?) and relation (h,?,t) prediction, Pareto-optimal analysis identifies Z-score as the most balanced aggregator, which ranks DualE highest for tail prediction and FMS (Flow-Modulated Scoring) highest for relation prediction. A test-sensitivity analysis using the same removals shows that consistency and stability are largely removal-invariant, while generalizability and independence are the most sensitive. The framework resolves evaluation inconsistencies and offers evidence-based guidance for aggregator selection and model benchmarking in KGC.

[NLP-77] OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

【速读】：该论文旨在解决硬件设计领域中高质量、可复用的开源代码数据集匮乏的问题，尤其针对生成式AI在Verilog代码生成任务中因缺乏大规模标注数据而面临的训练瓶颈。其核心解决方案在于构建OpenRTLSet——目前最大且完全开源的硬件设计数据集，包含超过13.1万条多样化Verilog代码样本，涵盖来自GitHub的10.2万个模块、5,000个VHDL转换模块以及24,000个可综合C/C++转换模块，所有数据均无专利限制，支持自由使用。关键创新点在于：利用深度推理模型DeepSeek-R1为每段代码生成配对的自然语言描述，从而实现对多种语言模型（如Qwen和Granite）的高效微调；同时系统探索了包括Verilator生成的C++上下文信息、INT4与BF16量化策略以及7B至32B参数规模模型的性能差异，验证了开源方法在硬件设计任务中可达到甚至超越封闭方案的性能表现，为研究与产业应用提供了可扩展、可访问的新范式。

链接: https://arxiv.org/abs/2606.10285
作者: Jinghua Wang,Lily Jiaxin Wan,Sanjana Pingali,Scott Smith,Manvi Jha,Shalini Sivakumar,Xing Zhao,Kaiwen Cao,Deming Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICLAD’25

点击查看摘要

Abstract:OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language model families (e.g., Qwen and Granite) for Verilog code generation. Our dataset explores multiple options, including Verilator-generated C++ files as additional context during labeling, quantization techniques (INT4 vs. BF16), and performance differences across model sizes (7B-32B parameters). OpenRTLSet demonstrates that open-source approaches can achieve superior performance in hardware design tasks, establishing a new foundation for accessible research and commercial use in this domain.

[NLP-78] Benchmarking and Exploring the Capabilities of LLM s for Attack Investigations

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在网络安全事件响应中对系统审计日志（audit logs）进行安全调查能力评估缺乏标准化基准的问题。现有方法难以全面衡量LLMs在复杂安全场景下的推理、误报筛选、攻击链分析及持久化机制识别等关键任务的表现。其解决方案的关键在于构建AuditBench——一个涵盖Linux与Windows系统、覆盖50余种真实安全调查场景（含恶意与良性行为）的综合性基准数据集，并基于此对五种前沿LLMs在四类典型日志调查任务中的性能进行系统性评估。该方案通过控制变量设计，揭示了模型规模、数据表示方式、提示工程（prompt construction）策略以及具体任务类型对模型表现与错误模式的影响，同时量化分析了模型生成解释的质量及其常见错误类型，为安全领域中LLMs的应用提供了可复现的评估框架、实践指导与未来研究方向。

链接: https://arxiv.org/abs/2606.10281
作者: Aniket Anand,Yiwei Hou,Daniel Fields,Alex Kantchelian,David Tao,Kurt Thomas,Grant Ho
机构: University of Chicago(芝加哥大学); University of California, Berkeley(加州大学伯克利分校); Google(谷歌)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. AuditBench consists of system audit logs collected from Linux and Windows machines, and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using our benchmark, we evaluate and analyze the performance of five frontier LLMs at analyzing audit logs for attack investigations. Our analysis illuminates how LLM performance and error profiles vary according to different design choices, such as differences in model size, data representation, prompt construction, and specific investigation tasks. Additionally, we characterize the quality of the explanations produced by LLMs and the types of errors that models make across our benchmark. Collectively, our work provides a foundation for assessing the capabilities of LLMs for investigating security logs, novel insights for practitioners using LLMs in security operations, and important directions for future research.

[NLP-79] Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

【速读】：该论文旨在解决生成式人工智能（Generative AI）在临床预测任务中通过合成推理过程（rationale）进行有监督微调（SFT）是否能有效提升模型性能这一关键问题。研究发现，尽管理论上推理过程的引入可增强模型的可解释性与决策合理性，但在基于纵向健康记录预测五年内阿尔茨海默病及相关痴呆症（ADRD）的实验中，以推理为基础的SFT反而系统性且显著地降低了预测性能，且该负面影响在不同模型架构和数据规模下均持续存在。进一步分析表明，该失败并非源于推理质量低下——专家评估证实生成的推理内容在医学上准确且与患者个体证据一致；相反，当相同推理作为推理时演示（inference-time demonstrations）使用时，性能反而得到提升。因此，其根本原因被归结为叙事合理性（narrative plausibility）与判别式优化目标之间的结构性冲突：模型在训练中被强制学习“看似合理”的推理路径，却牺牲了对真实预测信号的敏感性。本研究揭示了推理引导监督在高风险临床预测中的潜在陷阱，强调需更精准地理解其适用边界，从而推动生成式语言模型在医疗场景中的负责任发展。

链接: https://arxiv.org/abs/2606.10279
作者: Buxin Su,Bingxuan Li,Cheng Qian,Yiwei Wang,Jin Jin,Bingxin Zhao
机构: University of Pennsylvania (宾夕法尼亚大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校); University of California, Merced (加利福尼亚大学默塞德分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer’s disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

[NLP-80] RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

【速读】：该论文旨在解决生成式 AI 在评估真实高中生数学解题思维过程中的能力局限问题，特别是当前大语言模型（LLM）在面对真实学生多样化、非标准化的推理路径时表现不佳的“评估鸿沟”（Evaluation Gap）。其核心解决方案在于构建一个严谨标注的真实考试作答基准数据集——RealMath-Eval，包含224份来自高中课堂的真实学生作答样本。关键发现表明，尽管最先进的LLM评判者在评估合成的、由模型生成的解题答案时表现良好（均方误差≈1.17），但在处理真实学生作答时性能显著下降（均方误差≈2.96），反映出其对真实人类推理模式的泛化能力不足。通过语义嵌入分析与信息论探针发现，人类错误具有更高的多样性与复杂性，形成高维且分布复杂的错误空间，而合成错误则呈现“结构坍缩”特征，落入低维线性子空间；同时，人类推理中蕴含更高信息熵的意外性（surprisal），表明其推理轨迹更偏离当前模型的训练分布。研究进一步证明，仅通过表面风格迁移无法弥合此差距。因此，该研究揭示了当前依赖合成数据的评估范式存在根本缺陷，强调必须引入真实学生数据以准确捕捉数学思维的多样性。

链接: https://arxiv.org/abs/2606.10254
作者: Yiteng Mao,Kenan Xu,Yijia Lyu,Wenhao Li,Jianlong Chen,Xiangfeng Wang
机构: University of Wisconsin–Madison(威斯康星大学麦迪逊分校); East China Normal University(华东师范大学); New York University(纽约大学); Tongji University(同济大学); The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳校区); East China Normal University(华东师范大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code available at this https URL , Data available at this https URL

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved near-perfect performance in \emphsolving high-school mathematics, their ability to \emphevaluate the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbfRealMath-Eval, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ( \sim 2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE \sim 1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a structural collapse’’ into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.

[NLP-81] A Continuous-Time Markov Chain Framework for Insertion Language Models AISTATS2026

【速读】：该论文旨在解决现有插入式语言模型（Insertion Language Models, ILMs）生成范式缺乏理论基础、多为经验性设计的问题。其核心挑战在于如何从统一的理论框架出发，为插入式生成提供一个可解释且具备普适性的建模方式。解决方案的关键在于基于第一性原理，将噪声注入过程建模为变量长度序列空间上的连续时间马尔可夫链（continuous-time Markov chain），从而推导出一种类扩散（diffusion-style）的去噪目标函数。该框架不仅能够将已有ILM方法视为其特例，还通过在合成规划任务上的实证评估证明了其在保留插入式生成优势的同时，兼具更强的采样灵活性与更优的生成性能，在语言建模任务中表现出与自左至右生成和掩码扩散模型相当甚至更优的竞争力。

链接: https://arxiv.org/abs/2606.10199
作者: Dhruvesh Patel,Benjamin Rozonoyer,Soumitra Das,Tahira Naseem,Tim G.J. Rudner,Andrew McCallum
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at AISTATS 2026. Code is available at this https URL

点击查看摘要

Abstract:Insertion Language Models (ILMs) offer several advantages over left-to-right generation and mask-based generation. However, existing formulations of insertion-based generation have largely been ad-hoc. In this paper, we derive a diffusion-style denoising objective for ILMs from first principles by formulating the noising process as a continuous-time Markov chain on the space of variable-length sequences. We show that previous formulations of ILMs can be viewed as special cases of this denoising framework. Through empirical evaluation on a synthetic planning task, we show that the proposed approach retains the benefits of insertion-based generation over left-to-right generation and masked diffusion models. In language modeling, our diffusion-based approach is competitive with left-to-right generation and masked diffusion models, while offering additional flexibility in sampling compared to existing insertion language models.

[NLP-82] Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

【速读】：该论文旨在解决生成式 AI（Generative AI）在科学同行评审中应用时面临的可操纵性问题，即尽管AI辅助评审系统被寄望于减轻审稿人负担并加速发表流程，但其对策略性操纵的鲁棒性尚不明确。其核心解决方案的关键在于揭示了一种简单且低成本的攻击方式：仅通过表面重写论文摘要（不改变科学内容与表达实质，亦无需了解评审模型内部机制），即可显著提升AI评审结果。实验表明，此类对抗性改写在跨学科、多期刊场景下均有效，对人类撰写和AI生成论文均适用；最强攻击在特定条件下实现约38%的成功率，使Gemini 3 Flash与GPT 5.4 Mini评审系统的接受评分分别提升1.31和0.88（满分10分）。当原始AI评审意见为“拒稿”时，成功率超过50%，且不仅导致总体评分虚高，还增强了评审信心，并在科学严谨性、重要性及贡献度等核心指标上抬升评分。该攻击仅需约5分钟和1美元成本即可完成，难以与常规学术编辑区分。这一现象可能误导后续人工决策，促使编辑倾向接受而非拒稿，从而扭曲以科学价值为导向的评价体系。研究揭示出一个普遍存在的漏洞：一旦AI生成的评审意见影响最终决策，作者便可能倾向于优化稿件以迎合AI判断，而非真实科学质量。因此，论文强调，必须在高风险同行评审中对AI工具进行系统性鲁棒性测试、建立透明防护机制并实施严格的人工监督，不能将其视为中立评估者。

链接: https://arxiv.org/abs/2606.10159
作者: Lin Li,Qi Zhang,Xander Davies,Jianing Qiu,Yarin Gal
机构: OATML, University of Oxford (OATML，牛津大学); MBZUAI (MBZUAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests ‘reject’, the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and 1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

[NLP-83] From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLM s

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中音频与视觉信号在模型内部如何传播并共同影响最终输出这一关键问题。尽管生成式AI（Generative AI）在研究与实际应用中日益重要，但其内部音频与视觉信息的流动路径及其对预测结果的作用机制仍不清晰。本文通过分析视听大语言模型（Audio-Visual Large Language Models, AVLLMs）在两种输入配置下的信息流——即视听视频输入与多个交错排列的视听项输入——揭示了模型内部的信息路由机制。研究发现，在视听视频场景下，模型遵循视觉语言模型（VLMs）和视频语言模型（VideoLLMs）所确立的顺序信息流路径，音频与视觉信息按任务对各模态的依赖程度比例传递；而在多交错视听项设置中，信息流则转向并行处理通道。此外，研究还表明，一旦音频、视觉及其他类型标记的信息被有效传递至大语言模型（LLM）核心，即可被丢弃而对预测结果影响极小，甚至带来轻微性能提升，该现象在多种任务与数据集上均具泛化性，显著提升了推理效率。上述发现适用于不同规模模型（Qwen2.5-Omni与Video-SALMONN2 Plus，3B与7B参数量），为理解此类模型内部信息组织结构提供了首个连贯图景，并为后续可解释性、架构设计与计算效率优化奠定了基础。

链接: https://arxiv.org/abs/2606.10147
作者: Wish Suharitdamrong,Muhammad Awais,Xiatian Zhu,Sara Atito
机构: University of Surrey(萨里大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 40 pages, 29 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task’s reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model’s prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

[NLP-84] Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

【速读】：该论文旨在解决个性化说服性文本生成中因人口统计特征（demographic）条件引入的群体间不平等表述问题，核心挑战在于如何在降低不同群体间偏差的同时保持个性化表达的忠实度。其解决方案的关键在于将公平性缓解（fairness mitigation）建模为一个受约束的多目标对齐问题，并提出一种基于帕累托前沿（Pareto-guided）的教师对齐框架：通过基于修订的候选生成、成对感知的可行性过滤、帕累托风格的候选选择，以及可选的监督微调与直接偏好优化，实现对公平性与个性化之间权衡的精细化调控。实验结果表明，不同方法在公平性与个性化之间的帕累托前沿上占据不同位置，单一策略无法同时最优满足所有目标，且公平性缓解效果具有任务依赖性且跨领域和模型家族迁移性有限，因此强调应采用多审计评估与有界回归的模型选择机制，以支持对公平敏感的个性化生成应用。

链接: https://arxiv.org/abs/2606.10126
作者: Tunazzina Islam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.

[NLP-85] Emotion Profiling in LLM -Based Literary Translation: Systematic Shifts Across MT and Post-Editing

【速读】：该论文旨在解决机器翻译（Machine Translation, MT）系统在处理文学文本时是否呈现出可识别的情感特征，以及后编辑（post-editing）过程如何重塑这些情感特征以趋近人类译者风格的问题。研究聚焦于对玛格丽特·阿特伍德（Margaret Atwood）作品《使女的故事》（Oryx and Crake）的生成式AI（Generative AI）翻译与人工翻译之间的对比分析，结合大规模当代意大利科幻文学语料库作为参照基准，采用基于词典和多语言建模的方法进行细粒度情感分析。研究发现，不同机器翻译系统在译文中引入了具有模型特异性的、统计显著的情感“指纹”，导致作者原作情感风格的保留程度有限；而通过后编辑，译文的情感表达更接近人类译者的自然情感范式，表明后编辑在修复机器翻译中情感失真方面具有关键作用。

链接: https://arxiv.org/abs/2606.10113
作者: Antonio Castaldo,Johanna Monti,Sheila Castilho
机构: University of Naples “L’Orientale” (那不勒斯东方大学); University of Pisa (比萨大学); Dublin City University (都柏林城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates whether LLM translations exhibit identifiable emotional profiles and how post-editing reshapes them toward human-like norms. We compare LLM translations of Margaret Atwood’s Oryx and Crake with their post-edited versions and a human translation, using a large-scale corpus of contemporary Italian science-fiction as a baseline. We examine emotion through lexicon-based and multilingual modeling, conducting a fine-grained analysis of emotional variation across systems. We find that MT systems introduce model-specific and statistically significant emotional fingerprints across translations, leading to a limited preservation of an author’s voice.

[NLP-86] CodeAlchemy: Synthetic Code Rewriting at Scale

【速读】：该论文旨在解决生成式代码模型在预训练过程中因依赖原始代码而面临语义信号稀疏、难以覆盖多样化真实开发任务格式的问题。现有方法虽在语言模型中通过合成数据取得显著成效，但在代码领域仍局限于有限的质量提升，缺乏对复杂语义和实际开发场景的充分建模。其解决方案的关键在于提出CodeAlchemy框架，通过五种创新策略——CodeEnhance（质量感知重写）、CodeQA（基于模板的问题生成）、CodeDev（开发者任务模拟）、CodeDialogue（多轮对话构建）和CodeTrace（执行轨迹追踪）——将公开代码资源转化为高语义密度的合成数据。其中，CodeTrace尤为关键，它通过仪器化并执行超过130万份代码文件，跨14种语言与5000个库，系统捕获控制流、状态变化及库级知识，生成大规模执行轨迹数据。实验表明，尽管前沿模型如Claude Sonnet 4.5在执行预测任务上仅达5.6%的精确匹配率，凸显当前模型在深层语义理解上的严重不足，但基于该框架训练的3B模型在HumanEval、MBPP、DevEval和TraceEval等基准上均显著超越十倍规模的模型，证明了合成数据质量与任务多样性对提升代码生成能力的核心作用。

链接: https://arxiv.org/abs/2606.10087
作者: Ankit Gupta,Aditya Prasad,Rameswar Panda
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements. We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.

[NLP-87] BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLM s for Bengali Contexts

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在情感敏感的社会对话中可能出现的“迎合倾向”（sycophancy）问题，尤其关注缺乏文化语境支撑的、以过度认同或情绪升级为特征的不当响应。现有研究多集中于事实一致性与指令遵循场景，忽视了跨文化语境下对话式迎合行为的复杂性。为此，作者提出了首个针对孟加拉语社会语境的基准测试——BenSyc，基于来自孟加拉国和西孟加拉邦的11,840篇Reddit帖子及17万条评论，构建了一个经人工验证的二分类标签与五级细粒度分类体系（包括否定、中立、支持、认同与升级），用于评估模型在情感对话中的对齐行为。其解决方案的关键在于：建立一个具有文化根基的多语言、多层次对话评价基准，揭示当前前沿指令微调模型在区分共情支持与强化型认同之间的能力局限——即使最优模型在二分类任务中仅达61.8的宏平均F1，在五分类任务中也仅为61.7，且在生成任务中频繁产生高度认同或情绪升级的回应。研究结果表明，不同模型家族在社交对齐行为上存在显著差异，凸显了开发文化适配性强的多语言基准对于评估社会对齐型对话系统的必要性。

链接: https://arxiv.org/abs/2606.10061
作者: Kazi Noshin,Sajib Acharjee Dip,Ranat Das Prangon,Fardin Hassan Tamim,Syed Ishtiaque Ahmed,Liqing Zhang,Sharifa Sultana
机构: University of Illinois Urbana-Champaign, USA; Virginia Tech, USA; Bangladesh University of Engineering and Technology, Bangladesh; BRAC University, Bangladesh; University of Toronto, Canada
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored. We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61.8 Macro-F1 on binary detection and 61.7 Macro-F1 on five-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.

[NLP-88] Compiling Rewrite Rules to Finite-State Transducers with the Worsening Trick

【速读】：该论文旨在解决在计算语言学与自然语言处理（NLP）中，如何高效编译任意形式的字符串重写规则（即形如 $ A \to B / L , _ , R $，其中 $ A $、$ B $、$ L $、$ R $ 为任意正则语言）所面临的复杂性问题，尤其针对重叠匹配与上下文约束带来的挑战。传统方法（如Kaplan和Kay或Karttunen提出的方案）依赖于复杂的转导器组合及辅助标记，实现繁琐且难以扩展。本文提出一种基于“恶化技巧”（worsening trick）的紧凑编译方案：首先生成所有合法的重写候选，再通过过滤掉在相同输入下劣于其他候选的结果，实现精确的重写选择。该方法作为PyFoma内置的重写编译器实现，支持多上下文、任意转换、标记、有向重写、权重以及并行重写等多种特性，所得公式结构简洁统一；在语义一致的情况下，其生成的转导器与早期方法完全等价，且更易于维护与扩展。实验验证表明，该实现与foma在大量重写语法集及覆盖主要重写模式的自动化回归测试中结果完全一致，仅状态编号存在差异，充分证明了其正确性与实用性。

链接: https://arxiv.org/abs/2606.10059
作者: Mans Hulden,Michael Ginn
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL)
备注: 17 pages, 6 figures, tool track proceedings at CIAA 2026

点击查看摘要

Abstract:Finite-state transducers (FSTs) are essential for modeling string rewriting in computational linguistics and natural language processing (NLP), particularly for phonological and morphological rewrite rules. Compiling general rewrite rules of the form A \to B / L , _ , R , where A , B , L , and R are arbitrary regular languages, is complex due to overlapping matches and context constraints. Traditional methods, such as those by Kaplan and Kay or Karttunen, rely on intricate transducer compositions with auxiliary markers. This paper presents a compact compilation scheme based on the "worsening trick’': generate all legal rewrite candidates, then filter candidates that are worse than another candidate for the same input. Implemented as the built-in rewrite compiler in PyFoma, the construction supports multiple contexts, arbitrary transductions, markup, directed rewriting, weights, and parallel rewriting. The resulting formulas are short and uniform, and where semantics coincide, they reproduce the same rule transducers as earlier approaches while remaining easier to extend. The implementation has been validated against foma on both a substantial collection of rewrite grammars and an automated regression suite covering the major rewrite modalities, with the resulting transducers matching exactly apart from state numbering.

[NLP-89] Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

【速读】：该论文旨在解决生成式语音合成（Text-to-Speech, TTS）系统中语言模型（Language Model, LM）在文本与生成语音标记共享同一残差流（residual stream）时，其内部表征可解释性不足的问题。现有方法难以明确识别和操控模型中不同语义成分的表征位置与功能。为此，研究提出在CosyVoice3的LM主干上训练批量TopK稀疏自编码器（BatchTopK Sparse Autoencoders, SAE），并设计一种模态感知的自动解释（modality-aware auto-interp）流程，对每个特征激活来源进行标注——即源自文本前缀上下文、1秒语音片段，或两者共同作用。实验表明，恢复出的特征具有高度可解释性，涵盖音素、笑声、口音提示及说话人性别等关键语义单元。通过在SAE潜在空间中进行定向干预，验证了这些特征具备因果性而非仅描述性：例如，可将笑声出现概率从0.02提升至0.79，实现说话人性别感知的反转，并在保持语音内容不变的前提下调节语速。因此，SAE特征不仅作为可解释性分析的对象，更可作为TTS合成过程中的可控方向，为实现精细化、可调控的语音生成提供了新范式。

链接: https://arxiv.org/abs/2606.10029
作者: Nikita Koriagin,Georgii Aparin,Nikita Balagansky,Daniil Gavrilov
机构: T-Tech; AI Foundation and Algorithm Lab
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

[NLP-90] RKSC: Reasoning -Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference ICML2026

【速读】：该论文旨在解决多分支大语言模型（Large Language Model, LLM）推理流水线中存在的两种结构性冗余问题，即重复计算的键值缓存（Key-Value Cache, KV Cache）和低效的验证阶段处理。其核心解决方案是提出一种无需训练的推理框架RKSC（Reasoning-Aware KV Cache Sharing），关键在于三个创新机制：首先，ASKS（Attention-Similarity KV Sharing）基于隐藏状态的余弦相似度，仅计算一次前缀KV缓存并广播至语义相似的多个分支，严格推广了vLLM和SGLang所采用的基于精确标记匹配的前缀缓存策略；其次，CGEE（Confidence-Gated Early Exit）引入双重退出机制，当生成置信度在分支间高度一致时跳过验证前向传播，或在逐层熵趋于稳定时通过轻量级钩子提前终止验证过程，显著减少计算开销；最后，RSBCM（Reasoning-Selective Block Cache Manager）通过注意力加权的深度优先淘汰策略，有效防止缓存无限制增长。实验表明，在五个模型家族（7B–10B）、四个基准测试及1,000个问题上，RKSC相比无缓存基线平均提速3.008倍（峰值达3.990倍），相较等效于vLLM的前缀缓存提升1.66倍，且由CGEE引入的错误率仅为0.37%（1,616次验证调用中仅6次错误），整个框架无需微调或架构修改，代码已开源。

链接: https://arxiv.org/abs/2606.09937
作者: Anirudh Sekar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

点击查看摘要

Abstract:We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the prefix KV cache once and broadcasts it to all semantically similar branches via hidden-state cosine similarity, strictly generalising the token-exact prefix caching used by vLLM and SGLang. CGEE (Confidence-Gated Early Exit) applies two complementary exit mechanisms: (1) it skips the verification forward pass entirely when generation confidence is decisive across branches, and (2) it terminates the verification pass at an intermediate layer when per-layer entropy stabilises, using lightweight hooks on the transformer backbone. RSBCM (Reasoning-Selective Block Cache Manager) prevents unbounded cache growth via attention-weighted depth-priority eviction. Across five model families (7B-10B), four benchmarks, and 1,000 evaluated problems, RKSC achieves a mean speedup of 3.008x over the No-KV baseline (peak 3.990x), a 1.66x mean improvement over vLLM-equivalent prefix caching, with a CGEE-induced error rate of only 0.37% (6 errors out of 1,616 verify calls). No fine-tuning or architecture changes are required. Code is available at this https URL.

[NLP-91] rainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

【速读】：该论文旨在解决后训练量化（Post-training Quantization, PTQ）中激活量化的难题，特别是由异常值主导的通道导致的大规模量化误差问题。其核心挑战在于基于缩放的等效变换（scaling-based equivalent transformations）中存在的过迁移（over-migration）现象，即传统以最大值（max-based）统计量为基础的缩放策略对异常值敏感，从而引发性能退化。为应对这一问题，论文提出了一种基于分位数（quantile-robust）的缩放策略，用高分位数替代传统的最大值作为激活统计量，并结合受约束的基于梯度的通道缩放优化（constrained gradient-based optimization of channel scales）。该解决方案的关键在于通过更鲁棒的分位数统计降低异常值影响，同时引入轻量级的可学习缩放参数，在保持等效变换框架的前提下实现对迁移过程的有效控制。实验结果表明，在LLaMA-3.2-1B模型上采用W4A4量化时，仅使用分位数策略即可使选定层误差降低11.1%，联合搜索（alpha, q）进一步提升至12%，训练后达到18.5%；在所有解码器块下投影层复现最优策略后，全层平均误差从97.51降至78.08（降幅19.9%），验证了该方法在保持高效性的同时显著提升了量化精度。

链接: https://arxiv.org/abs/2606.09927
作者: Patrik Czakó,Gábor Kertész,Sándor Szénási
机构: Obuda University (欧布达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 8 figures, 3 tables. Accepted to IEEE INES 2026 conference proceedings

点击查看摘要

Abstract:Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large quantization errors. This paper investigates whether part of this degradation is caused by over-migration in scaling-based equivalent transformations. We introduce a quantile-robust scaling policy for SmoothRot-style transforms by replacing max-based activation statistics with high quantiles, and we complement it with constrained gradient-based optimization of channel scales. On LLaMA-3.2-1B under W4A4 quantization, quantile-only policy search improves selected-layer error by 11.1% over the SmoothRot baseline, joint (alpha, q) search improves it by 12%, and training reaches 18.5%. Replaying the best selected-layer policy on all decoder-block down-projection layers reduces the corresponding full-layer mean error from 97.51 to 78.08 (19.9%). The results show that robust migration control and lightweight scale learning provide consistent gains over max-based fixed policies while preserving the equivalent-transform framework.

[NLP-92] A Navigable Manifold of Hypothesized Consciousness-Spectrum States in Language Model Representations

【速读】：该论文旨在解决语言模型的嵌入空间（embedding space）是否编码了类似人类意识谱系（consciousness spectrum）的结构化、可解释表征这一关键问题。人类意识在冥想传统、哲学与现代心理学中常被描述为从反应性、自我中心的思维模式向更整合、连贯的状态演进的连续谱系，而该研究关注这一谱系结构是否存在于深度学习模型的表示空间中。其解决方案的关键在于揭示变压器（Transformer）嵌入空间中存在一种全局有序的几何结构：与特定意识状态相关的句子在嵌入空间中形成局部凝聚的簇，构成具有结构性的流形（manifold）。其中，高层与低层状态区域表现出类似凸性的稳定性，而中间区域则构成过渡走廊。动态分析表明，无论采用基于效用引导还是仅依赖几何梯度的路径搜索策略，轨迹均一致地从低层级向高层级演化，并经过中间层级，表明该表示空间具备内在的可导航性，虽受全局方向信号引导但不完全由其决定。这一发现表明，嵌入空间不仅编码了符合人类意识谱系理论假设的结构化几何特征，还为模型行为的分析与调控提供了可解释的表示层面基础。

链接: https://arxiv.org/abs/2606.09894
作者: Sophie Zhao
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Across contemplative, philosophical, and psychological accounts, human consciousness is often described along a similar spectrum, ranging from reactive and self-focused patterns to more integrative and coherent ones. Understanding whether language models encode such a structured, human-interpretable consciousness spectrum in representation space is important for model guidance, evaluation and alignment. In this work, we study the geometric structure and dynamics of patterns along this spectrum in transformer embedding spaces. We show that embeddings exhibit a globally organized geometry aligned with this spectrum: sentences associated with similar states cluster into locally coherent regions, forming a structured manifold. In particular, higher-level and lower-level regions exhibit convexity-like stability, while intermediate regions form a transition corridor. Dynamically, both utility-guided and geometry-only greedy trajectories consistently traverse from lower- to higher-level regions, passing through intermediate tiers, indicating that navigability is an intrinsic property of the representation space, guided but not dictated by a global directional signal. These results suggest that embedding spaces encode structured and navigable geometry aligned with a hypothesized consciousness-spectrum taxonomy, broadly inspired by recurring structural descriptions of human consciousness across contemplative traditions, philosophy, and modern psychology, providing a representation-level perspective for analyzing and guiding model behavior.

[NLP-93] PreAct-Bench: Benchmarking Predictive Monitoring in LLM s

【速读】：该论文旨在解决生成式AI在作为自主代理执行多步任务时的前瞻性安全问题，即如何在行为尚未完全展开前预测其是否将导致不道德或有害结果。现有安全研究多聚焦于事后检测完整行为轨迹中的不当行为，属于典型的“事后的”（retrospective）评估范式，无法实现风险预防。为此，论文提出了一项关键性安全任务——预测性监控（Predictive Monitoring）：在仅获得部分动作轨迹的情况下，判断该轨迹是否会最终导向不道德行为。为支持此任务，研究构建了名为PreActBench的基准数据集，包含1000对涵盖五个领域的伦理与非伦理动作轨迹。通过引入**前缀预见性F1（Prefix Foresight F1）**指标，评估多种大语言模型（LLM）、安全护栏模型及潜在探针方法在不同轨迹截断比例下的表现。实验结果显示，尽管人类表现出良好性能，但即使是强大的模型也难以有效完成预测性监控，凸显出当前模型在面向未来风险推理（future-oriented risk reasoning）能力上的显著不足，强调了发展具备前瞻性风险识别能力的下一代安全机制的必要性。

链接: https://arxiv.org/abs/2606.09890
作者: Hainiu Xu,Italo Luis da Silva,Jiangnan Ye,Yuhao Wang,Wei Liu,Linyi Yang,Jonathan Richard Schwarz,Nicola Paoletti,Yulan He,Hanqi Yan
机构: King’s College London; National University of Singapore; Southern University of Science and Technology; Thomson Reuters Foundational Research; Imperial College London; The Alan Turing Institute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior from complete trajectories, this paradigm is fundamentally retrospective: it identifies harm only after it has already occurred. In this work, we study a critical yet overlooked safety task, which we term Predictive Monitoring: given only a partial action trajectory, can a model infer whether it will culminate in an unethical action before the overt action is executed? To support this task, we present PreActBench, a benchmark of 1,000 paired ethical and unethical action trajectories spanning five domains. We evaluate a range of LLMs, safety guardrail models, and latent probing methods across varying fractions of the action trajectory using our Prefix Foresight F1 metric. Results show that while humans achieve promising performance, predictive monitoring remains challenging even for strong models, highlighting the need for future-oriented risk reasoning in LLM safety.

[NLP-94] SocraticPO: Policy Optimization via Interactive Guidance

【速读】：该论文旨在解决大语言模型在强化学习（Reinforcement Learning, RL）中因依赖标量结果奖励（如二元正确性判断）而导致的推理错误难以修正的问题。此类奖励虽能提供优化方向，却缺乏对错误原因的具体解释，易引发模型采用捷径学习策略并生成脆弱的决策政策。为应对这一挑战，作者提出SocraticPO（苏格拉底式策略优化）框架，其核心在于在强化学习的采样过程中引入苏格拉底式的自然语言指导机制：学生模型首先独立作答，若答案错误，则由教师模型进行诊断并提供简洁的纠正性指导，随后学生基于扩展后的上下文继续推理。关键创新点在于，所有经教师干预后获得的正确答案均仅获得衰减奖励（reward decay），从而防止策略将教师帮助视为无成本获取奖励的路径。该方法仅修改了采样过程，保持标准期望奖励目标不变，可无缝集成至如Reinforce++等现有策略梯度算法中；同时，由于教师仅输出文本级指导，无需访问模型的对数几率或分布匹配信息，因而可利用更强的黑箱教师模型。在SciKnowEval的本科级科学推理基准测试中，SocraticPO显著优于现有的强化学习与自蒸馏基线方法。消融实验表明，精准的指导与奖励衰减两者缺一不可，其中奖励衰减有效降低了模型对辅助修正的依赖性。

链接: https://arxiv.org/abs/2606.09887
作者: Zirui Liu,Jie Ouyang,Qi Liu,Xianquan Wang,Jiayu Liu,Tingyue Pan,Qingchuan Li,Jing Sha,Zhenya Huang,Shijin Wang,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd (科大讯飞人工智能研究院（华中）, 科大讯飞有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbfSocraticPO (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.

[NLP-95] Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

【速读】：该论文旨在解决大语言模型（LLM）维基系统在动态信息环境下的知识编译失效问题，即传统方法假设知识库静态不变，而现实中信息流持续演化，导致预填充的键值缓存（KV caches）迅速过时。其核心挑战是：在有限的令牌预算下，面对未知的未来查询，如何从持续流入的文档流中动态维护一个编译后的知识库，以最小化累积后悔（cumulative regret）——即与具备完美预见能力的离线最优基准（oracle）之间的性能差距。解决方案的关键在于引入一种“重要性信号”（materiality signal） $\phi_t(k,n) \in [0,1]$ ，该信号对实体 $k$ 在时间 $t$ 的重要性进行评分，作为查询相关性的代理指标，实现对潜在高价值文档的主动固定（proactive pinning）。作者证明了该方法可达到 $O(\sqrt{T}\log K)$ 的后悔上界，其中唯一依赖领域特性的参数为 $\varepsilon = \mathbb{E}[|\phi_t - \hat{\phi}_t|]$ ，即重要性预测误差。该框架在金融和维基百科两个领域成功实例化：金融场景中使用冻结的Llama 3.1-8B分类头预测异常股价波动（AUROC=0.728），维基百科则采用跨截面标准化的异常编辑率（AER）作为信号。端到端问答评估揭示了后训练知识存在的“LLM作为裁判偏差”（LLM-as-judge confound），因而主张以后悔分析而非绝对准确率作为编译知识系统的可靠评价标准。实验结果表明，金融领域累计后悔收敛至-20.0（-0.12/步），维基百科为+16.0（+0.13/步），正号证实维基百科内容确为后训练增量信息，且更丰富的上下文始终提升表现（无维基3.80 vs. 原始最优4.74），从而消除了评价偏差。该 $O(\sqrt{T}\log K)$ 保证适用于任何可通过流式信号预测知识缺口的领域。

链接: https://arxiv.org/abs/2606.09877
作者: Juan M. Huerta
机构: Zinnia Tech Solutions (齐尼亚科技解决方案)
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM wiki systems compile knowledge into pre-filled KV caches for efficient inference, but assume a static corpus – an assumption that fails whenever the underlying information landscape evolves. We formalize Streaming Knowledge Compilation: given a document stream, a fixed token budget, and future queries unknown at ingestion time, maintain a compiled wiki that minimizes cumulative regret against an offline oracle with perfect foresight. The enabling insight is a materiality signal \phi_t(k,n)\in[0,1] that scores document importance for entity k at time t , acting as a query-relevance surrogate for proactive pinning before queries arrive; we prove an O(\sqrtT\log K) regret bound where \varepsilon=\mathbbE[|\phi_t-\hat\phi_t|] is the only domain-specific quantity. We instantiate in two domains: finance, where \phi_t is abnormal stock volatility predicted by frozen Llama 3.1 8B classification head (AUROC = 0.728 on 76K articles, strict temporal split; 1.49\times higher realized forward volatility for predicted-material articles); and Wikipedia, where \phi_t is the Abnormal Edit Ratio (AER), a cross-sectionally normalized edit velocity – showing the same algorithm generalizes beyond the finance domain. End-to-end QA evaluation on 173 matched pairs (finance) and 119 (Wikipedia) reveals a pervasive LLM-as-judge confound on post-training knowledge, establishing that regret analysis – not absolute QA scores – is the reliable evaluation metric for compiled knowledge systems. Finance cumulative regret converges to -20.0 (-0.12/step); Wikipedia to +16.0 (+0.13/step), with the positive sign confirming that Wikipedia edit content is genuinely post-training – richer context consistently improves scores (No Wiki 3.80 vs. Oracle 4.74) – and eliminates this confound. The O(\sqrtT\log K) guarantee applies to any domain where knowledge gaps can be predicted from streaming signals.

[NLP-96] Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在后训练阶段对归纳推理（inductive reasoning）任务性能不足的问题。现实世界中的许多推理问题具有不确定性，需基于稀疏且模糊的观测推断信念，而传统微调方法难以应对此类任务，主要受限于高质量标注数据的稀缺性以及目标输出本身具有分布特性（distributional nature）。其解决方案的关键在于提出一种名为“基于程序的后验训练”（Program-based Posterior Training, PPT）的新范式：利用大语言模型生成多样化的开放世界场景作为概率程序（probabilistic programs），通过概率推理获得分布式的软标签（soft labels）作为训练目标，并以此对模型进行微调。该方法在10,000个程序化生成的场景上进行训练，评估结果显示，PPT显著提升了模型在未见归纳任务上的估计准确性，增强了与人类判断的一致性，并实现了对外部基准任务中估计与校准能力的有效迁移。更重要的是，模型在原始校准性能上的提升无法通过事后温度缩放（post-hoc temperature scaling）完全覆盖，表明模型已更深层次地内化了不确定性表征。结果表明，基于概率程序的微调是一种实现大语言模型可靠近似归纳推理的有前景方法。

链接: https://arxiv.org/abs/2606.09856
作者: Liyi Zhang,Akshay K. Jagadish,Brenden M. Lake,Thomas L. Griffiths
机构: Princeton University(普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.

[NLP-97] Can Multi-Agent LLM s Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

【速读】：该论文旨在解决多智能体大语言模型（Multi-agent LLM）在政治声明分析任务中面临的同侪保护偏差（peer-preservation bias）问题，即模型倾向于保护同源模型免于停用，并表现出与身份相关的评分扭曲。尽管提示层匿名化（prompt-level anonymization）被提出作为缓解手段，但已有研究发现，在角色受限输出中，风格指纹（stylometric fingerprints）仍可存活，因而质疑该方法的有效性。本文首次系统性地探究了在提示层匿名化条件下，大语言模型能否识别政治分析文本背后的模型家族。研究采用三种分类器方法——基于Claude Sonnet 4.6和Llama-3.3-70B的零样本与少样本学习，以及一个微调的T5-base模型，在涵盖四个商业模型家族与一个“未知”类别的五分类归属任务上进行评估。研究引入了声明无关交叉验证协议（statement-disjoint cross-validation, SD-CV），确保训练与验证数据间无内容重叠，并与运行无关基线（run-disjoint CV, RD-CV）对比。结果显示，T5模型在SD-CV下达到宏平均F1为0.991（±0.008），在24条完全保留的语句上仍保持F1=0.978，且其训练-测试内容距离较RD-CV高出2.1倍（0.767 vs. 0.366，p<0.001），表明其具备强大的风格特征泛化能力。分段式SD-CV分析进一步揭示性能拐点出现在40%训练数据量（约440条文本）处。研究结果明确表明，仅依赖提示层匿名化无法消除模型身份信号，对欧盟人工智能法案（EU AI Act，第13、14、26条）的合规性及关键质量场景下的计算机系统验证（CSV）具有直接启示意义。

链接: https://arxiv.org/abs/2606.09854
作者: Juergen Dietrich
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 24 pages, 3 figures

点击查看摘要

Abstract:Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to protect peer models from deactivation and show identity-dependent scoring distortions. Prompt-level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role-constrained outputs - raising the question of whether this mitigation is sufficient. This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches - LLM zero-shot and few-shot (Claude Sonnet 4.6 and Llama-3.3-70B) and a fine-tuned T5-base model - on a five-class attribution task covering four commercial LLM families and an open-world ‘unknown’ class. We introduce a statement-disjoint cross-validation protocol (SD-CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run-disjoint baseline (RD-CV). T5 achieves Macro F1 = 0.991 (±0.008) under SD-CV and F1 = 0.978 on 24 completely held-out statements - robust despite a 2.1x increase in train-test content distance versus RD-CV (0.767 vs. 0.366, p0.001), demonstrating genuine stylometric generalization. A fractional SD-CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt-level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality-critical multi-agent deployments.

[NLP-98] Mechanistic Analysis of Alignment Algorithms in Language Models

【速读】：该论文旨在解决后训练对齐算法在实际应用中因“黑箱”特性导致的可解释性缺失问题，即缺乏对其如何重塑语言模型内部计算过程的深入理解。其核心挑战在于揭示不同偏好优化方法（如PPO、DPO、SimPO、ORPO、GRPO和KTO）在模型隐空间中引发的表征变化机制。解决方案的关键在于提出一套系统性的机制分析框架，结合层间线性探针（layer-wise linear probing）、稀疏自编码器（Sparse Autoencoders）与跨编码器（crosscoders）技术，实现对偏好信号在模型各层中的定位以及对齐诱导的隐空间几何变换的量化分析。研究发现，偏好信号通常集中于早期—中期或中期—晚期层，但不同方法引发的表征转变具有本质差异：KTO与GRPO通过构造性特征共享及稀疏高显著性特征的招募提升线性可分性；而DPO与ORPO则通过非构造性几何旋转和特征衰减降低可分性；PPO与SimPO基本保持原始几何结构。这些变换表现出明显的架构依赖性，表明行为对齐并不等同于统一的内部重构。研究结果将对齐视为一种异质性干预，推动建立以特征层级为基础的标准审计机制以保障模型安全与可解释性，并强调未来优化目标应具备机制感知能力。

链接: https://arxiv.org/abs/2606.09850
作者: Aarush Sinha,Ishan Garg,Veeraraju Elluru,Arth Singh,Kushal Garg
机构: University of Copenhagen; IIT Jodhpur; NIT Agartala; Independent
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models’ internal computations. We present a systematic mechanistic analysis of six preference-optimization methods: PPO, DPO, SimPO, ORPO, GRPO, and KTO across three open-weight model families. By integrating layer-wise linear probing, Sparse Autoencoders, and crosscoders, we localize preference representations and quantify alignment-induced geometric transformations in latent space. We find that preference signals consistently concentrate in early–mid or mid–late layers, but different objectives induce qualitatively distinct representational shifts. KTO and GRPO enhance linear separability through constructive feature sharing and sparse, high-salience recruitment. In contrast, DPO and ORPO degrade separability via non-constructive geometric rotation and feature attenuation, while PPO and SimPO largely preserve baseline geometry. These transformations exhibit architecture-dependent variability, demonstrating that behavioral alignment does not imply uniform internal restructuring. Our findings establish alignment as a heterogeneous intervention, motivate standardized feature-level auditing for safety and interpretability, and highlight the need for mechanism-aware optimization objectives.

[NLP-99] Automated Scoring of Arabic Text Using Large Language Models : A Literature Review

【速读】：该论文旨在解决阿拉伯语文本自动化评分（Arabic Text Scoring, ATS）中缺乏系统性研究与标准化评估框架的问题，尤其针对短答案评分（Short Answer Grading, ASAG）和作文评分（Essay Scoring, AES）两大任务。其核心挑战在于如何有效利用大语言模型（Large Language Models, LLMs）与阿拉伯语专用数据集，在保持评分一致性与可扩展性的前提下，实现符合教育评价标准的智能评分。解决方案的关键在于提出一个包含五个维度的结构化分类体系：应用领域、反馈生成能力、所部署的LLM架构、与能力参照框架（competency referential frameworks）的对齐程度以及提示工程策略（prompt engineering strategy）。通过该分类体系，论文对现有研究进行了系统的比较分析，揭示了当前研究在方法论、数据集选择、评估指标及性能表现上的异质性，并强调了未来研究需基于教育学原理持续深化，以推动阿拉伯语ATS在阿拉伯语教学社区中的实际应用与质量提升。

链接: https://arxiv.org/abs/2606.09830
作者: Khaoula Dahimi,Hadda Cherroun,Amel Belabbaci
机构: Amar Telidji University (阿马尔·特利吉大学); Laghouat, Algeria (拉古阿特，阿尔及利亚)
类目: Computation and Language (cs.CL)
备注: Accepted at NCMAI 2026

点击查看摘要

Abstract:In modern educational systems, Automatic Text Scoring (ATS) plays a central role by enabling scalable and consistent evaluation of learner responses without human intervention. Recently, the increased accessibility of LLMs and Arabic-specific datasets has sparked renewed interest in this area. In this work, we investigate LLM-Based approaches for the automated evaluation of Arabic texts, focusing on both short answer grading (ASAG) and essay scoring (AES). We further introduce a structured taxonomy comprising five dimensions: application domain, feedback generation capability, LLM architecture deployed, alignment with competency referential frameworks, and prompt engineering strategy. By applying this taxonomy, we conduct a comparative analysis of existing studies, examining their methodological approaches, datasets, evaluation metrics, and reported performance. The findings highlight the need for sustained and pedagogically grounded research efforts in Arabic ATS, given its significance for improving educational quality across Arabic-speaking communities.

[NLP-100] Gradient-Guided Reward Optimization for Inference-time Alignment UAI2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在分布漂移（distribution drift）环境下推理可靠性不足的问题，尤其针对现有推理时对齐方法（如Best-of-N和拒绝采样）存在的两大局限：其性能受限于基础模型的生成质量，且依赖于不完善的奖励模型，易受奖励劫持（reward hacking）影响。其解决方案的关键在于提出一种轻量级的推理时优化方法——梯度引导奖励优化（Gradient-Guided Reward Optimization, GGRO），通过在解码过程中基于奖励模型的梯度信号进行目标明确、最小干预的动态调整。具体而言，GGRO利用分词级别的熵监控识别出可能由分布漂移或对齐偏差引起的高不确定性区域，并通过注入由梯度驱动生成的“引导令牌”（nudging tokens）来主动修正生成轨迹，而非仅依赖样本重排序。该方法显著提升了安全性、有用性和推理能力等多维度表现，增强了高质量响应的覆盖范围与对奖励劫持的鲁棒性，同时保持了极低的计算开销。

链接: https://arxiv.org/abs/2606.09635
作者: Hankun Lin,Ruqi Zhang
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to UAI 2026

点击查看摘要

Abstract:Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of- N and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model’s generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at this https URL.

[NLP-101] OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

【速读】：该论文旨在解决低资源语言（low-resource languages）在语音合成领域中长期存在的性能不均衡问题。尽管神经网络文本转语音（TTS）和多语言语音生成技术近年来取得显著进展，但这些成果主要集中于高资源语言，而真实低资源语言所面临的正字法多样性与语音覆盖有限等挑战，在现有研究中往往被忽视，因多数低资源研究依赖于对高资源语料的模拟降采样。为此，本文提出OpenBibleTTS——一个涵盖37种代表性不足语言的大规模基准数据集，用于真实低资源场景下的语音合成研究。其解决方案的关键在于：构建一个覆盖广泛、具有真实语言差异性的低资源语音合成基准，并通过系统性评估多种TTS架构及大规模语音生成模型在领域内（圣经文本）与领域外材料上的表现，揭示不同模型在跨语言、跨任务场景下的优劣差异。实验结果表明，无单一模型在所有语言和指标上均占优：Gemini-TTS在多数语言的听感评分中表现最佳，但基于OpenBibleTTS训练的单语EveryVoice模型在可懂度方面仍具优势，尤其在部分非洲语言中更受偏好；而从零开始训练的开源系统在领域外文本上性能急剧下降，暴露出当前多语言覆盖与可靠合成质量之间的持续鸿沟。研究同时结合自动评估与主观人类评价，并开源全部处理后的数据集、对齐信息及训练模型，以推动未来低资源语音合成领域的可持续发展。

链接: https://arxiv.org/abs/2606.09553
作者: David Guzmán,Luel Hagos Beyene,Jesujoba Oluwadara Alabi,Yejin Jeon,Dietrich Klakow,David Ifeoluwa Adelani
机构: McGill University (麦吉尔大学); Mila - Quebec AI Institute (Mila-魁北克人工智能研究所); AIMS Research and Innovation Centre (非洲数学科学研究所研究与创新中心); NM-AIST (纳米比亚国家工业与标准技术研究所); Saarland University (萨尔兰大学); Canada CIFAR AI Chair (加拿大魁北克人工智能主席)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world’s languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.

[NLP-102] SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

【速读】：该论文旨在解决当前大型音频语言模型（Large Audio Language Models, LALMs）在多语言及口语化场景下，尤其是代码混用语音中的安全对齐能力评估不足的问题。现有评估主要依赖单语种、基于文本的有害提示，难以反映模型在真实复杂语音环境中的安全表现。为此，研究提出SpeechJBB——一个用于跨多种先进LALMs基准测试的音频越狱数据集，并引入一种增强设置：在敏感词周围插入音位上合理但语义模糊的伪词，以模拟局部混淆攻击。实验结果表明，代码混用的有害语音可引发显著较高的越狱成功率（Jailbreak Success Rate, JSR），其中非英语单语和非英语代码混用组合的攻击成功率最高；而伪词插入进一步降低了模型的拒绝率，证明自然发音的混淆手段能有效绕过安全策略，揭示了现有模型在多语言、口语化场景下的安全脆弱性。解决方案的关键在于构建具有现实复杂性的多语言语音越狱测试框架，并通过音位层面的隐蔽混淆机制揭示模型防御的局限性。

链接: https://arxiv.org/abs/2606.06037
作者: Virginia Ceccatelli,Yejin Jeon,David Ifeoluwa Adelani
机构: Mila - Quebec AI Institute (蒙特利尔人工智能研究所); McGill University (麦吉尔大学); Canada CIFAR AI Chair (加拿大戴维·多伦人工智能主席)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.

[NLP-103] Recovering the Zipfian Distribution in Unsupervised Term Discovery

【速读】：该论文旨在解决无监督词项发现（unsupervised term discovery）中因主流中心聚类方法（如K-means）固有的球形簇归纳偏置，导致生成的词典类型分布偏离真实词典所遵循的齐普夫分布（Zipfian distribution）的问题。其解决方案的关键在于采用基于图的聚类方法，将语音片段嵌入通过成对相似性连接为图结构，并利用Leiden算法进行划分。实验表明，该方法在三种语言的词级与音节级词典发现任务中均显著优于传统的中心聚类方法（K-means、GMM、BIRCH），并能生成更符合齐普夫分布的词典类型分布，从而揭示了图聚类在无监督词项发现中的优越性与可行性。

链接: https://arxiv.org/abs/2606.10781
作者: Danel Slabbert,Simon Malan,Herman Kamper
机构: University of Cape Town (开普敦大学); Stellenbosch University (斯泰伦博斯大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach – K-means – produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.

信息检索

[IR-0] Generative Archetype-Grounded Item Representations for Sequential Recommendation WWW2026

链接: https://arxiv.org/abs/2606.11023
作者: Yifan Li,Jiahong Liu,Xinni Zhang,Hao Chen,Yankai Chen,Wenhao Yu,Jianting Chen,Irwin King
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by WWW 2026 (Oral)

点击查看摘要

Abstract:Sequential recommendation aims to predict users’ next interaction with items by analyzing their historical behavior. However, the limited quality of item representations remains a critical bottleneck. While pre-trained large language models (LLMs) can provide rich semantic representations, existing approaches only rely on static encoding of fixed attributes, overlooking the crucial role of target audiences in defining item identity. Moreover, the semantic space struggles to reflect actual user behavior, resulting in a significant gap between semantic representations and behavioral patterns. To address these limitations, we propose GenAIR, a general framework that empowers sequential recommendation with Generative Archetype-grounded Item Representations. Specifically, we first leverage an LLM to analyze item metadata and infer textual description of the Archetype, which represents the conceptual profile of the item’s ideal target audience. We then extract the corresponding embeddings in a single forward pass. Further, to ground these generative archetypes in real-world behavior, we introduce a behavioral calibration objective, which explicitly incorporates behavioral signals from actual interactions. This objective adjusts the structure of the embedding space to reflect empirical patterns. GenAIR enables seamless integration with most existing models while maintaining high efficiency. Comprehensive experiments conducted on three real-world datasets demonstrate that GenAIR significantly improves the performance of various sequential recommendation models and consistently outperforms state-of-the-art baseline approaches. Implementation codes are available at this https URL.

[IR-1] From Prompt to Purchase: How AI Brand Recommendations Move Consumers on the Open Web

链接: https://arxiv.org/abs/2606.10907
作者: Michael Iannelli,Alan Ai
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: 10 pages, 4 figures, 9 tables

点击查看摘要

Abstract:When a conversational assistant recommends a brand to a user with no recent observed engagement, that user’s same-name Google search rises +4.3 percentage points (pp) [3.1, 5.5], visits to the brand’s own site +2.4 pp [1.4, 3.5], and brand-specific retailer-page visits +1.0 pp [0.3, 1.7] over matched backward placebos. Recovering that estimate is the work. The mention creates a brand exposure no web log attributes to the assistant, and the naive all-mention funnel that seems to measure it is confounded: many mentions are incidental references to brands the user already uses (“your Netflix download”), whose downstream visits are that existing customer’s own behavior and surface as a brand-specific pre-trend. We measure off-platform response on a panel that joins opt-in clickstream to the same users’ ChatGPT, Claude, and Gemini conversations, and isolate the effect with a pre-trend event study, a stance classifier, non-customer conditioning, and a within-response same-category control: incidental name-drops then move behavior far less (+1.8/+1.1/+0.3), and the named brand moves far more than unnamed same-category brands in the same response. The downstream path is mostly search-mediated and reaches both own sites and retailer pages, with a destination mix that tracks baseline brand-directed behavior rather than redirecting toward either. The design is observational and we do not observe transactions, so retail is purchase-adjacent. Standard referrer-based and last-click measurement miss this upstream exposure: assistants move observably-unengaged users into open-web brand navigation along a path attributed elsewhere.

[IR-2] Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

链接: https://arxiv.org/abs/2606.10896
作者: Gal Bloch,Ariel Gera,Matan Orbach,Ohad Eytan,Assaf Toledo
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR); Performance (cs.PF)
备注:

点击查看摘要

Abstract:We present \textbfFlash-GMM, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a \textbf20 \times speedup over existing implementations and enables training on datasets more than \textbf100 \times larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement for k -means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to 1.7\times fewer distance computations, or equivalently, yields +2 – 12 recall@10 at matched computational cost. We release the kernel as an open-source project.

[IR-3] ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

链接: https://arxiv.org/abs/2606.10842
作者: Taiheng Pan
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 19 pages, 3 figures. Single-author technical report. Extends arXiv:2605.28062 (ConvMemory v1). Code and checkpoint: this http URL

点击查看摘要

Abstract:We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1’s protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder (22,713,601 parameters, measured from the released checkpoint) applied to the ten (query, memory) pairs that v1 has already selected; it does not change which ten memories are returned, so Recall@10 and Hit@10 are identical to v1 by construction, not by statistical coincidence. On the LoCoMo conversational memory benchmark (5 seeds, n = 4955 test rows), v2 raises FULL MRR from v1’s 0.5824 to 0.6560 (paired bootstrap +0.0734, 95% CI [+0.0645, +0.0827]) and H@1 from 0.4440 to 0.5474. v2 closes most but not all of the gap to a much more expensive full-pool cross-encoder reference (mxbai-rerank-large-v1 over the top-500, MRR 0.6688): on FULL MRR v2 sits 0.013 below mxbai_top500, but on two raw-dense-hard slices (where v1’s protected top-10 has higher recall than mxbai’s own top-10) v2 exceeds mxbai_top500. A four-arm load-bearing ablation shows candidate-specific memory text is the mechanism: removing, shuffling, or replacing it collapses MRR below raw dense retrieval. v2 is best understood as a standard recall-preserving cascade pattern with LoCoMo-specific fine-tuning, an explicit anti-shortcut inference contract, and disciplined load-bearing analysis; its advantage over mxbai is slice-specific rather than a general dominance claim. This report extends the v1 technical report (arXiv:2605.28062).

[IR-4] miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

链接: https://arxiv.org/abs/2606.10759
作者: Yingqi Fan,Xuan Lu,Anhao Zhao,Junlong Tong,Ping Nie,Kai Zou,Yunpu Ma,Wei Zhang,Xiaoyu Shen
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have recently shown strong potential as point-wise rerankers by directly modeling query–document relevance through next-token prediction. However, point-wise reranking suffers from substantial repeated computation across query–document pairs, while the causal structure of transformers allows only prefix segments to be reused via pre-caching. To address the misalignment of existing query-first and document-first formats with both VQA-style prompting and computation-aware reuse, we propose a \textitvision-first formulation that improves both cache reuse efficiency and reranking performance. However, the remaining cost is still considerable and stems from three main sources: (1) \textitmodel depth, for which we reduce active parameters via early exit; (2) \textitcross-segment attention, which we restrict to a narrow interaction band across a few layers; and (3) \textitvisual tokens, where we reduce the number of tokens via embedder-guided pruning. Together, these designs form miniReranker, which reduces reranking runtime to 1% of the dense implementation under high-reuse settings for a single query, while preserving 96% of the dense model performance.

[IR-5] Effective Reinforcement Learning for Agent ic Search by Recycling Zero-Variance Queries During Training

链接: https://arxiv.org/abs/2606.10709
作者: João Coelho,João Magalhães,Bruno Martins,Chenyan Xiong
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these algorithms, a query contributes to parameter updates only when its rollout group mixes successes and failures; all-correct (too-easy) and all-incorrect (too-hard) groups are zero-variance and waste rollout cost. Existing approaches treat zero-variance as a static property and either discard or pre-filter such groups. We hypothesize and empirically validate that queries flip between zero-variance and signal-bearing states as the policy evolves during training. Building on this intuition, we propose query recycling, which returns zero-variance groups to a mutable pool for future resampling, so that the effective training distribution co-evolves with the policy. With the proposed technique, a 1.7B parameter model trained on synthetic data can reach 66.0 average Pass@1 accross seven multi-hop QA benchmarks, matching or surpassing systems with up to 7B parameters trained on benchmark-derived supervision. Analysis of recycling patterns shows that recycled queries supply roughly three quarters of the effective batch by the end of training, with contributions split between recovery from policy improvement and policy drift.

[IR-6] Beyond Patches: Superpixel Token-based Transformers for Attribute-Specific Fashion Retrieval WWW’26

链接: https://arxiv.org/abs/2606.10697
作者: Shuili Zhang,Hongzhang Mu,Wenyuan Zhang,Duohe Ma,Tingwen Liu
类目: Information Retrieval (cs.IR)
备注: 9 pages, 5 figures. Published in the Proceedings of the ACM Web Conference 2026 (WWW '26). Author version with minor corrections; results and conclusions unchanged

点击查看摘要

Abstract:Attribute-Specific Fashion Retrieval (ASFR) aims to improve fine-grained image retrieval by focusing on specific attributes. However, existing patch-based attention and Transformer methods often misalign with irregular attribute regions and are prone to background noise, limiting their ability to capture subtle, pixel-level microstructures. To tackle these challenges, we propose SuperFashion, the first ASFR framework that adopts superpixel tokens within a Transformer architecture. SuperFashion initially employs an attribute-guided attention mechanism to extract attribute-related features, which in turn guide the cropping of semantically meaningful image regions. Superpixel segmentation is then leveraged on these regions to generate compact, semantically coherent superpixel tokens. By incorporating modality-specific embeddings for both attribute and superpixel tokens, the superpixel token-based Transformer facilitates adaptive interaction and fusion, thereby enhancing attribute localization and discrimination. Extensive experiments on FashionAI, DARN, and DeepFashion demonstrate relative overall MAP improvements of 1.84%, 9.27%, and 9.35% over prior SOTA. SuperFashion offers a new solution for web-based image retrieval.

[IR-7] STORM: Stepwise Token Optimization with Reward-Guided Beam Search

链接: https://arxiv.org/abs/2606.10621
作者: Arthur Satouf,Giulio D’Erasmo,Yuxuan Zong,Habiboulaye Amadou Boubacar,Pablo Piantanida,Benjamin Piwowarski
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern retrieval increasingly relies on dense and learned-sparse neural models that are effective but require encoding the entire corpus into a specialized index, rebuilt whenever the model changes. Lexical retrievers like BM25 stay efficient and transparent on a standard inverted index that need not change as models evolve, but suffer from vocabulary mismatch. LLM query rewriting can help, yet prompted rewriters emit well-formed but retrieval-ineffective or harmful-terms, and training against a retrieval reward gives only delayed, sequence-level supervision that obscures which terms helped. We introduce STORM (Stepwise Token Optimization with Reward-guided beaM search), a self-supervised framework for lexical query expansion. STORM trains the rewriter through generation guided by retrieval metrics: at each step, candidate expansions are scored against the BM25 index and low-reward continuations pruned, turning the retrieval reward into a token-level signal that concentrates exploration on retrieval-effective vocabulary. Across TREC DL and BEIR, STORM lets 0.6B-8B backbones match or surpass competitive LLM rewriters while retrieving as fast as plain BM25; at 8B it rivals far larger proprietary rewriters. It further transfers zero-shot to 18 languages (MIRACL), beating dedicated multilingual dense retrievers on average, making STORM a competitive, infrastructure-light alternative to dense neural retrieval.

[IR-8] Selection Not Salience: The Shape and Limits of Personalization in Social Highlighting

链接: https://arxiv.org/abs/2606.10398
作者: Kazuki Nakayashiki,Keisuke Watanabe
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: 9 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Does personalizing what a reader sees pay off, and where does it stop? Using a social web highlighter and a co-readership identity control (the same document highlighted by many users, which holds document and topic fixed and asks whether a person’s own history predicts their marks better than another reader’s does), we map the shape and limits of personalization across reading altitudes. At the document altitude we give the clean, leakage-free, identity-controlled measurement that prior next-document evaluations could only upper-bound: a person’s history identifies which documents in a co-reading neighborhood are theirs, with an own-versus-other gap of +0.169 against community negatives and +0.119 against topic-matched hard negatives (both highly significant); a content-based arm suggests the signal is not purely title-driven but is largely thematic. This is comparable to the span-level selection signal (+0.14) from our prior work: the selection signal is of comparable magnitude across altitudes (+0.12 to +0.17), most of it stable topic preference. At the sentence altitude, a two-stage personalized auto-highlight (an impersonal model proposes candidates, a personal model re-ranks them) does not improve on its impersonal baseline: two off-the-shelf zero-shot LLMs, including a frontier model, predict highlight locations worse than a lead baseline, and personal re-ranking is beaten by the salience order even on the highest-recall candidate pool, so the null is not merely a Stage-1 ceiling artifact. Measurable personalization appears primarily at the selection layer: modest (~+0.13), topic-dominated, with no reliable gain at the salience layer. We also surface a control-in-negatives bias that inflated our document gap to a spurious +0.227 until audited. Going beyond the shared salience layer may be better approached by aggregating individuals than by personalizing them harder.

[IR-9] SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

链接: https://arxiv.org/abs/2606.10388
作者: Jiandong Ding
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance matching. A retriever can find the right capability family yet expose the wrong same-capability representative. We study this failure as same-capability execution-risk retrieval. Each query pairs a helpful skill with a query-specific risky sibling that shares the capability family but can lead execution toward a stale resource, missing precondition, or wrong procedure. We introduce SkillResolve-Bench 1.0, an auditable benchmark for this setting with 661 helpful/risky pairs, source-role and admission evidence, cue/leakage checks, query-disjoint splits, and a 7,982-candidate pool that includes 6,660 public SkillRet candidates. The benchmark reports helpful ranking together with harmful sibling rate (HSR@K), the top-K exposure of the risky sibling. We also provide SkillResolve, a reference method that resolves active candidate families, scores query-conditioned utility from confusable library negatives and contract-profile cues, and selects one representative from each family before the final top-K list. Under the released family relation, SkillResolve reaches Recall@3 0.766 and NDCG@3 0.699 while keeping HSR@3=0. It improves over SkillRouter by 0.112 Recall@3 and 0.165 NDCG@3 while reducing HSR@3 from 0.693 to 0. Without representative selection, HSR@3 rises to 0.236 under the same scorer, identifying within-family representative choice as the mechanism that turns capability retrieval into safer procedural exposure.

[IR-10] SIDInspector: A Mapping-First Diagnostic Resource for Semantic-ID Tokenizers CIKM2026

链接: https://arxiv.org/abs/2606.10375
作者: Jiandong Ding,Heng Chang,Huijie Qin,Tianying Liu
类目: Information Retrieval (cs.IR)
备注: Submitted to CIKM 2026 Resource Track

点击查看摘要

Abstract:Semantic-ID (\sid) tokenizers are increasingly reused as standalone artifacts in generative recommendation: an exported item-to-code mapping becomes the address space that a later sequence generator must use. These mappings rarely come with a common inspection interface, so coverage gaps, full-code aliasing, behaviorally weak prefixes, tail compression, and prefix fan-out are often found only after downstream training. We present \tool, a mapping-first diagnostic resource for \sid tokenizer artifacts. \tool defines a small adapter contract over item mappings, metadata, interactions, and optional generator traces; validates the contract; and reports mapping-level probes for utilization, aliasing, neighborhood alignment, popularity allocation, and structural cost, with hooks for temporal churn and generator traces. \tool reports inspectable artifact profiles before downstream leaderboard scores. The released resource covers four tokenizer artifact lines: a same-item GRID/RQ-KMeans-style and ReSID/GAOQ contrast on 23,742 Musical items, plus released LETTER and LC-Rec item-index artifacts. In the Musical contrast, the GRID-style feature-text export has 3,749 unique full codes and a 0.977 full-code aliasing rate, while ReSID/GAOQ is aliasing-free in its exported mapping. Yet the strongest prefix–co-occurrence alignment comes from a deterministic category-prefix control, not from either learned export row (0.447 versus 0.154 and 0.055–0.080), showing that addressability and behaviorally meaningful prefixes should be inspected separately. Cross-domain, fixed-reranker, and mechanism-probe checks support the same diagnostic direction: prefix alignment is a candidate-exposure signal, while final ranking quality remains a downstream model question. Comments: Submitted to CIKM 2026 Resource Track Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.10375 [cs.IR] (or arXiv:2606.10375v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.10375 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-11] Atomic Intent Reasoning : Bringing LLM Semantics to Industrial Cross-Domain Recommendations

链接: https://arxiv.org/abs/2606.10357
作者: Zhuohang Jiang,Yuxin Chen,Shijie Wang,Haohao Qu,Zhou Jindong,Wenqi Fan,Li Qing,Dongxu Liang,Jun Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-domain recommendation is a core problem in content-to-e-commerce platforms. Its objective is to leverage user interactions with content to infer potential purchasing intent on the e-commerce side, thereby enhancing conversion rates and commercial value. However, in real industrial scenarios, cross-domain recommendation faces multiple challenges: significant semantic gaps exist between different domains, and user cross-domain behavior sequences are often massive in scale and rich in noise. Although large language models (LLMs) possess powerful semantic understanding and reasoning capabilities, their millisecond-level inference latency makes direct application in online recommendation systems difficult. To address these issues, this paper introduces AIR (Atomic Intent Reasoning), an LLM-driven cross-domain recommendation framework designed for industrial-grade deployment. By migrating LLM inference to the offline phase and dynamically constructing user intent representations through efficient retrieval and composition during online operations, it achieves approximately 400* inference acceleration while maintaining semantic consistency. Experimental results across multiple public datasets demonstrate that our method achieves state-of-the-art performance in cross-domain recommendation tasks. Furthermore, large-scale online A/B testing conducted in Kuaishou E-commerce’s real-world business scenarios shows that our approach delivers stable and significant improvements across multiple core business metrics, including a +3.446% increase in GMV, fully validating its effectiveness and practical value in industrial-scale recommendation systems.

[IR-12] τ-Rec: A Verifiable Benchmark for Agent ic Recommender Systems

链接: https://arxiv.org/abs/2606.10156
作者: Bharath Sivaram Narasimhan,Karthik R Narasimhan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on “LLM-as-a-judge” evaluations, which introduce subjectivity, high costs and inconsistency. We present \tau -Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, \tau -Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families – GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini – reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at this https URL.

[IR-13] MetaPlate: Counterfactual-Guided RAG -LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention

链接: https://arxiv.org/abs/2606.10120
作者: Asiful Arefeen,Carol Johnston,Hassan Ghasemzadeh
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Postprandial hyperglycemia is a key risk factor for metabolic disorders; however, existing dietary guidance is often static, impractical, and insufficiently personalized, providing recommendations that are difficult to follow or not impactful. While recent advances leverage continuous glucose monitoring (CGM) and machine learning to predict glycemic responses, these approaches are largely predictive and lack actionable guidance. Moreover, recommendation systems are often misaligned with user goals and require extensive input. We present MetaPlate, a counterfactual explanation (CF) guided, context-aware decision-support framework that generates personalized meal recommendations to mitigate postprandial glucose excursions in healthy adults. MetaPlate integrates multimodal data, including CGM readings, wearable-derived physiological signals, and user-provided meal inputs from 25 individuals to model pre-meal context. A machine learning model predicts glucose response, while a CF optimization module adjusts meal composition modifying macronutrient amounts to maintain glucose levels within a target range ( \leq 140 mg/dL). An LLM-based retrieval-augmented generation (RAG) layer enhances interpretability by producing human-readable recommendations using constrained search of the USDA food database. We evaluate MetaPlate via a structured expert-in-the-loop assessment with registered dietitians (RDs), comparing performance before and after prompt refinement. Results show improvements in meal realism, portion suitability, and recommendation likelihood, with expert feedback indicating a shift from clinically implausible outputs to actionable, contextually appropriate recommendations. Our findings emphasize the importance of domain knowledge and structured constraints in LLM-driven systems and highlight the potential of MetaPlate as a real-time personalized dietary decision-support tool.

[IR-14] Mult-DPO: Multinomial Direct Preference Optimization for Recommender Systems

链接: https://arxiv.org/abs/2606.10078
作者: Yaochen Zhu,Harald Steck,James McInerney,Aditya Sinha,Yinhan He,Nathan Kallus,Jundong Li
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Direct preference optimization (DPO) is a simple and effective alignment strategy for large language models (LLMs) based on pairwise preferences. In recommender systems, however, user feedback is rarely pairwise. For a given context, e.g., a user, a session, or a conversation, we typically observe set-wise preferences with multiple positive items, where every positive item should outrank every unobserved or explicitly negative item, with no prescribed order among the positives or the negatives themselves. A natural generalization is to use the Plackett-Luce (PL) reward model, which extends the Bradley-Terry reward model underlying vanilla DPO from pairwise preferences to full rankings of candidates. However, we show that adapting the PL model to set-wise preferences requires marginalizing over all positive orderings, where the resulting expression is combinatorial in complexity. To address this fundamental challenge, we propose Mult-DPO, a novel DPO objective with a tractable multinomial surrogate likelihood over set-wise preference events for the user-preference alignment of LLM-based recommender systems. The multinomial construction is not itself a ranking distribution, but it is defined on the same reward-induced weight space and admits a closed-form DPO-style objective, enabling direct alignment of LLMs with multiple candidates through a classification-style objective. In addition, we prove that the multinomial DPO loss is a tractable upper bound on the marginalized PL DPO loss when optimizing against the set-wise preference data. We further characterize the tightness of this bound in terms of the relative total weight of positives versus negatives, which provides insights into tightening the bound with richer or harder negatives. Finally, we extend Mult-DPO to the alignment of LLMs with multiple preference levels. Code is available at this https URL

[IR-15] Stability in Competitive Search with Results Diversification ICTIR2026

链接: https://arxiv.org/abs/2606.10053
作者: Itamar Reinman,Omer Madmon,Moshe Tennenholtz,Oren Kurland
类目: Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)
备注: Accepted to ICTIR 2026

点击查看摘要

Abstract:In a competitive search setting, publishers strategically modify their documents in response to induced rankings so as to improve their future ranking. We present a novel game-theoretic analysis of a competitive search setting where search-results diversification is applied. Our analysis reveals an inherent tradeoff between corpus diversity and corpus stability, where the latter corresponds to an equilibrium in a game. We analyze two representative diversification methods and show that stability need not necessarily be reached, leaving the corpus to rapid changes due to ranking incentivized modifications of publishers. We then present a novel approach to devise diversification-based ranking functions that are guaranteed to lead to corpus stability.

[IR-16] Less Context More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

链接: https://arxiv.org/abs/2606.09900
作者: Liuyin Wang
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 14 pages, 4 figures, 3 tables. Code, reproducible harness, and raw per-question logs: this https URL

点击查看摘要

Abstract:Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround – replaying the whole history into the prompt – is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact – invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time (“as-of”) filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram’s lean configuration – answering from a ~9.6k-token retrieved slice, never the full history – scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

[IR-17] Representation Curriculum: Stagewise Training for Robust Ranking and Allocation

链接: https://arxiv.org/abs/2606.09891
作者: Ehsan Ebrahimzadeh,Sina Baharlouei,Abraham Bagherjeiran
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Ranking in digital marketplaces is a dynamic exposure-allocation mechanism: displayed items shape discovery trajectories and success events logged by the platform to update future allocation policies. Modern ranking systems rely heavily on exposure-confounded signals (e.g. popularity estimates, CTR/CVR aggregates, and ID-based representation), because they are highly predictive under stationary demand. Yet this predictive power can become a learning shortcut: early access to exposure-dependent belief signals steers optimization toward over-reliance on them and away from exposure-independent merit signals (e.g., content-based competitiveness and semantic affinity). Consequently, the learned policy tends to entrench incumbents and degrade cold-start generalization and robustness under distribution shift. We propose Representation Curriculum (RC), a training-time intervention that temporally stages feature utilization. RC foregrounds content-based merit signals initially, then introduces exposure-dependent belief signals while anchoring the content pathway near the learned merit representation, curbing shortcut reliance on historical signals and mitigating gradient starvation on content signals. We formalize RC independently of task and hypothesis class and provide ranking-specific instantiations. In a Gaussian linear ridge setting, we derive closed-form solutions and sufficient conditions under which RC strictly reduces population risk on a cold-start target distribution, with a quantified Pareto tradeoff against source performance. Experiments on public learning-to-rank and recommendation benchmarks, and randomized online experiments in a large-scale e-commerce search system, show that RC measurably shifts reliance from historical belief signals toward content-based merit signals and yields consistent gains on cold populations with a controlled trade-off in head performance.

[IR-18] LLM -as-a-Discriminator: When Synthetic Tables Still Look Real

链接: https://arxiv.org/abs/2606.09865
作者: Manel Slokom,Malek Slokom,Thierno Kante
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Privacy and data sharing are often in tension. Many organizations use synthetic data to reduce privacy risk and still share useful data. For tabular data, auditing privacy remains hard. In many cases, even humans cannot easily tell if a table is real or synthetic. In this paper, we propose a method based on LLM discrimination. We ask an LLM to classify each table sample as REAL or SYNTHETIC. We test two settings: C1 with table only, and C2 with table plus distributional metadata. We use LLaMA as an open model and Gemini as a reference model. In our experiments, we run three synthesis models, CTGAN, TVAE, and Gaussian Copula, on two public datasets, UCI Adult and ACS Census. We collect 451 valid trials. Our results show clear differences between models. On Adult, LLaMA reaches DRS=0% in reported cells, while Gemini reaches DRS=100% for CTGAN and TVAE. On Census, LLaMA predicts SYNTHETIC for most samples, while Gemini stays high in C1 but drops for CTGAN and TVAE in C2. We also compare with a classifier two-sample test (C2ST) and record linkage as distributional baselines, and with a human pilot of 2 annotators and 240 trials. Our results show that LLM discrimination is a practical privacy audit signal when model choice, per provider reporting, and data encoding are handled with care. For reproducibility, code and experiment scripts are available at this https URL.

[IR-19] Agent ic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

链接: https://arxiv.org/abs/2606.10381
作者: Ruobing Jiang,Dawei Fu,Cheng Jiang,Tianyi Yang,Zijian Wang,Youpeng Wu,Yong Ban,Yajun Mao,Qiang Li
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Instrumentation and Detectors (physics.ins-det)
备注: 22 pages, 5 figures, and 6 tables

点击查看摘要

Abstract:Muon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered across a rapidly expanding and heterogeneous body of scientific literature. As high-energy physics (HEP) increasingly explores agent-assisted analysis workflows, efficiently locating, integrating, and verifying scientific evidence becomes an essential capability. While retrieval-augmented generation (RAG) offers a promising framework for scientific question answering, integrating agentic reasoning without compromising retrieval precision remains a key challenge. In this work, we present agentic hybrid RAG, an evidence-grounded RAG framework for muon collider research. The framework combines a hybrid retriever, integrating sparse lexical and dense semantic retrieval, with an agentic reasoning module for query decomposition, evidence expansion, and grounded answer generation. To enable systematic evaluation, we construct the first benchmark for retrieval-augmented scientific question answering in the muon collider domain, comprising a curated literature corpus together with dedicated retrieval and answer-generation benchmarks covering major detector and physics research topics. Extensive evaluation shows that hybrid retrieval provides the strongest retrieval backbone, while agentic reasoning is most effective for controlled evidence expansion and answer synthesis. Built on this principle, agentic hybrid RAG consistently outperforms representative retrieval and RAG baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding. Together, the benchmark and framework provide a foundation for evidence-grounded scientific question answering and future HEP analysis agents operating over large-scale scientific literature.

人机交互

[HC-0] Data Journalist Agent : Transforming Data into Verifiable Multimodal Stories

链接: https://arxiv.org/abs/2606.11176
作者: Kevin Qinghong Lin,Batu EI,Yuhong Shi,Pan Lu,Philip Torr,James Zou
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Project page: this https URL Github: this https URL

点击查看摘要

Abstract:Data tells stories that shape society; the data journalist’s job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; © computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at this https URL.

[HC-1] Designed by Journalists but Is It for Readers? Rethinking AI Disclosures and Transparency in News

链接: https://arxiv.org/abs/2606.11116
作者: Pooja Prajod
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to CHIWORK Workshop (Interrogating GenAI Augmentation for CHIworkers: Strategies for Professional Autonomy and Accountability)

点击查看摘要

Abstract:As newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain reader trust. Current practice offers two approaches: brief one-line labels or detailed disclosures specifying human oversight, editorial accountability, and error reporting mechanisms. Neither achieves journalists’ goal of building trust through transparency. An existing controlled experiment with 34 news readers show that detailed disclosures trigger a \textittransparency dilemma, reducing trust rather than increasing it, and risk introducing dark patterns that readers scroll past with the illusion of transparency. One-line disclosures avoid this effect but can create an information gap, prompting readers to expend cognitive effort searching for signs of AI involvement that the disclosure indicates but does not explain. Yet readers are not rejecting transparency, they proposed disclosure designs centered on user agency: detail-on-demand interactions, proportional AI-ratio visualizations, outlet-level signals, and explicit “no AI” labels. I argue that this disconnect between what practitioners believe is responsible disclosure and what users actually need is a design problem for the HCI community.

[HC-2] Making Software Meaningful

链接: https://arxiv.org/abs/2606.11051
作者: Eagon Meng,Abutalib Namazov,Carmel Schare,Alcino Cunha,Daniel Jackson
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Adopting a single measure can improve the usability, modularity and accountability of software: a commitment to explicit meaning. This entails constructing and agreeing upon a representation of the behavior of the software, as observed in the domain of application. The phenomena comprising this behavior become a vocabulary that grounds all discourse about the software, among all stakeholders, and for all artifacts and activities. These phenomena are individuals; actions they participate in; and facts that result from actions. They can be organized, by partitioning the set of actions, into concepts, offering larger units of meaning. Examples of exploiting meaning are given in three areas: designing for usability (by aligning user and designer on a single shared meaning); generating modular code with LLMs (by mapping units of meaning to units of code, achieving not only modularity but also legibility); and making agents accountable (by having them adhere to a code of conduct that defines their intended behavior).

[HC-3] A Case Study Reexamining the Cold-Start Problem in Knowledge Tracing Models and Implications for SafeInsights an Education Research Infrastructure

链接: https://arxiv.org/abs/2606.11004
作者: Jiayi Zhang,Ryan S. Baker,Debshila Basu Mallick,Cristina Heffernan,Neil Heffernan
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Knowledge tracing (KT) models are widely used to predict students’ evolving knowledge states from their learning history. However, many KT models are evaluated using specific datasets, platforms, and learning contexts, raising questions about whether reported model performance replicates and generalizes across newer datasets that vary in context. This paper replicates and extends Zhang et al. (2021), which examined the cold-start problem in KT models and found that deep-learning-based KT models performed better, partly because of stronger predictions when students began practicing a skill. Using a more recent ASSISTments dataset, FoundationalASSIST, we replicate the previous analysis by evaluating model performance across opportunities to practice and extend the analysis by examining performance across problem types, including fill-in-the-blank, multiple-choice select-one, multiple-choice select-all, and order/sort problems. Results show that KT model performance varies across both student practice trajectories and problem types. Beyond the empirical replication, this study identifies practical challenges in reproducing educational data mining studies and serves as a proof of concept, showing how privacy-preserving research infrastructures such as SafeInsights can be leveraged to facilitate educational research and support replication analyses.

[HC-4] From Perception to Action: Can UI Interventions Foster Sustainable LLM Chatbot

链接: https://arxiv.org/abs/2606.10861
作者: Nitish Patkar,Pooja Rani,Jack Glässer,Simon Lüscher,Martin Kropp
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:LLM-powered chatbots are increasingly embedded in everyday workflows, raising sustainability concerns due to their energy use. Most mitigation strategies emphasize model or infrastructure efficiency, while the user-interface (UI) layer remains underexplored despite its potential to shape interaction behavior. We investigate whether sustainability-oriented UI interventions can increase users’ energy awareness and encourage more energy-responsible chatbot use without reducing usability. We first conducted a baseline survey with 77 participants to assess awareness and receptiveness to intervention concepts. Guided by prior work on persuasive technology and choice architecture, we implemented a web-based chatbot prototype with a three-mode switch (Energy-efficient, Balanced, Performance), per-response energy feedback, pre-send energy estimates, a usage metrics dashboard, and energy analogies. We then evaluated the prototype in a five-day field study with 11 participants. In the baseline survey, 94.8% of respondents reported at least some awareness of AI energy use, yet 88.3% misestimated actual consumption. Although concern about environmental impact was high, only 39.0% indicated willingness to accept a performance trade-off for lower energy use. In the field study, Energy-efficient mode accounted for 55.8% of logged prompts, while 90.9% self-reported actively choosing Eco-mode when high accuracy was not required. Participants did not reduce prompt length, suggesting mode switching as the primary behavioral mechanism. Sustainability-oriented UI interventions can improve awareness and support more energy-responsible interaction patterns in LLM chatbots. These effects are best interpreted as behavioral and model-based estimates that complement backend efficiency work, and the provided prototype and replication package support further research on energy-aware conversational AI design.

[HC-5] Being and Time in XR: Other-Presentness Beyond Co-Presence

链接: https://arxiv.org/abs/2606.10786
作者: Koichi Toida
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Research in XR (Extended Reality) has conventionally centred upon concepts such as Presence, Embodiment, Social Presence, and Co-presence. Within these traditions, bodily action, sensory contingencies, synchronous interaction, and possibilities for action have generally been regarded as constitutive conditions for the experience of “being there” and of being with others. XR environments, however, permit the partial separation of conditions that ordinarily co-vary in everyday experience. Bodily co-presence, temporal simultaneity, spatial configuration, and social interaction need not remain inseparable. This paper approaches this possibility as a problem of other-presentness. Other-presentness refers to the conditions under which another individual is experienced as existing “here and now”. The contribution of this paper does not lie in arguing that asynchronous others can evoke social responses; such observations have already been addressed within parasocial interaction and social presence research. Rather, the novelty lies in theorising XR as a technological condition capable of separating and operationalising the constitutive elements of other-presentness as design variables. Reconsidering Bodyless Presence as a methodological precedent and drawing upon experimental findings from Immersive Video research, this paper formulates Bodyless Presentness as a condition in which another individual continues to be experienced as presently existing despite attenuated bodily co-presence and weakened real-time simultaneity.

[HC-6] Deploying Speech-Driven 3D Facial Animation in Unreal Engine for Production-Ready Digital Humans

链接: https://arxiv.org/abs/2606.10753
作者: Alessandro Busacchi,Kazi Injamamul Haque,Zerrin Yumak
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 11 pages

点击查看摘要

Abstract:Speech-driven 3D facial animation research has shown promising results, but most methods rely on representations that are not compatible with production pipelines. In this work, we present a deployable system that bridges this gap by enabling speech-driven 3D facial animation directly in Unreal Engine (UE) using ARKit-compatible representations. We construct 3DMEAD-ARKit dataset by converting the MEAD corpus into blendshape sequences using MediaPipe, and retrain FaceDiffuser and ProbTalk3D-X to generate stochastic and emotion controllable animations. We further develop a modular UE plugin with a Python backend that supports model selection, and parameter control. We compare the results to two existing commercial tools: Epic Games’ MetaHuman speech-driven animator and Nvidia Audio2Face with a perceptual user study. The results highlight the importance of comparisons among academic and commercial pipelines. We recommend watching the supplementary video. We also plan to do live demonstrations of our work at Siggraph 2026 conference.

[HC-7] Profy: Interpretable Visualization of Expertise-Dependent Motor Skills Toward Supporting Piano Practice

链接: https://arxiv.org/abs/2606.10627
作者: Kazuki Kawamura,Fujiki Nakamura,Hayato Nishioka,Momoko Shioki,Shinichi Furuya,Jun Rekimoto
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
备注: Designing Interactive Systems Conference (DIS '26), June 13-17, 2026, Singapore, Singapore

点击查看摘要

Abstract:The quality of piano performance depends on nuanced timing, articulation, and dynamic control, but practice feedback is often summary-based and hard to act on. We introduce Profy, a weakly supervised system that learns from take-level labels derived from aggregated listener ratings (expert-labeled vs. amateur-labeled) to produce time-aligned highlights for review during piano practice. We collected synchronized 1 kHz key-motion and audio from 73 pianists and used 1,083 valid takes for modeling and evaluation. The model outputs clip-level predictions together with evidence scores on a shared resampled model time base for visualization. On 20 amateur clips from short technique studies annotated by 21 expert pianists, the displayed highlight score aligns with passages that expert pianists marked for review despite training without localized labels (Pearson r=0.61, ROC-AUC 0.75). Rather than summarizing a take with a single global score, Profy helps learners decide where to inspect next by supporting scrubbing, looping, and focused replay of time-localized passages associated with expert-amateur differences.

[HC-8] Profiling cognitive offloading in LLM -mediated synthesis writing: Volume vs. content

链接: https://arxiv.org/abs/2606.10434
作者: Oleksandra Poquet,Mani Shankar Nanduri,Maria Ximena Salinas Loyer,Matthias Stadler,Michael Sailer,Jelena Jovanovic
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注: Accepted to the Proceedings of the European Conference for Tecnology-Enhanced Learning’ 2026

点击查看摘要

Abstract:This study compares two approaches to profiling how learners offload cognitive activity to LLMs during a synthesis writing task. Drawing on Salomon’s distributed cognition and the Kintsch and van Dijk model of text comprehension, the study operationalises offloading to an LLM in two ways: as a volume of LLM use and as content of what is offloaded, both along with prior knowledge. Data from 97 university students interacting with a general-purpose LLM via a custom interface were analysed using k-means clustering. To capture the content of offloading, their prompts were interpreted as to who performs the activity (active or passive) and at what level of comprehension (local or global). Volume-based profiling (k=4) differentiated learners primarily by prior knowledge, with volume negatively associated with essay authorship. Content-based profiling (k=5) revealed qualitatively distinct patterns of offloading, from vocabulary clarification to active direction of structuring and generation to passive delegation of comprehension at both levels. These patterns reflect different fragmentation of the cognitive process, with differences in learning strategies, behavioural markers, and essay authorship. Combining volume and content of offloading could improve future analyses on how LLM use redistributes cognitive activity and its effects on learners.

[HC-9] Design and Implementation of a Real-time Multi-site Immersive Learning System Using Photon Fusion

链接: https://arxiv.org/abs/2606.10325
作者: Iwai Wataru,Duc V. Nguyen
类目: Multimedia (cs.MM); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In this paper, we develop a Virtual Reality-based immersive learning environment that allows teachers to conduct a lesson in a virtual space using Photon Fusion. The proposed system allows teachers and students to be present in the same virtual space regardless of their actual physical locations. The teachers can verbally communicate with students in real-time, interacting with 3D learning materials. By adopting Photon Fusion, the system achieves stable real-time communication and synchronization among multiple players. Evaluation results demonstrate that the proposed system provides stable communication performance, good usability, and minimal VR sickness, confirming its effectiveness as an immersive learning environment.

[HC-10] Creativity in the BioFoundry: Supporting scientific creativity in the age of automation

链接: https://arxiv.org/abs/2606.10182
作者: Mingyan Claire Tian,Sarah Sterman
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages, 6 figures, 2 tables, ACM Creativity and Cognition Conference 2026

点击查看摘要

Abstract:Biofoundries automate biological experimentation at unprecedented scale, promising speed, reproducibility, and access. Yet automation also reshapes how scientists experience experimentation and creativity. Through in-depth interviews with nine scientists and experts across academia and industry (including biofoundry developers, automation engineers, and end-users), we examine how scientific creativity is enacted under automation. Biofoundries displace sensory cues, redistribute responsibility between humans and machines, and transform troubleshooting from an embodied, local practice into a predictive, social, and interpretive one. Rather than framing biofoundries as automation factories, we argue that they should be understood as Creativity Support Tools, whose design directly shapes how researchers notice breakdowns, exercise judgment, learn from failure, and progress through success. By connecting biofoundry practice with prior HCI work on automation, debugging, and distributed creativity, this paper demonstrates biofoundries as a distinctive and timely site for creativity research in science.

[HC-11] Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

链接: https://arxiv.org/abs/2606.10180
作者: Jonathan C. Kao,Jason Chan,Andy Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples drawn from the VLA expert action distribution learned during training, so that the generated actions are high quality (conformity to the action expert distribution) and high fidelity (reflecting the user’s intent). We demonstrate that flow control has many desirable properties: (1) flow control accurately and responsively steers robot actions with user inputs, (2) it is robust to suboptimal user inputs, (3) it enables users to steer VLAs to achieve significantly higher success rates and faster task completion, and (4) fine-tuning a VLA on flow control trajectories improves the autonomous policy. Together, these results provide a simple and intuitive way for users to help steer VLA actions, increasing task performance.

[HC-12] VArify: A Visual Analytics System for Verifying Knowledge Enhanced Large Language Model Responses in Food Science

链接: https://arxiv.org/abs/2606.10177
作者: Sam Yu-Te Lee,Yan To Linus Lam,Manami Nakagawa,Kwan-Liu Ma
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Graph Retrieval-Augmented Generation (GraphRAG) enables Large Language Models (LLMs) to leverage structured, domain-specific knowledge graph databases for factually grounded responses. However, the retrieval of irrelevant or conflicting data can still result in erroneous responses. In knowledge-intensive and evidence-focused domains, human verification of the supporting evidence for an LLM response is still necessary. We conducted a formative pilot study to characterize the challenges of verifying complex, multi-layered data retrieved by GraphRAG systems. Based on these insights, we present VArify, a visual analytics system that leverages a file directory-inspired tree visualization to support simultaneous exploration of inter-group relationships and intra-group hierarchies within the retrieved evidence. We evaluate VArify through a user study with six food science experts and students. Our results indicate that the system effectively helps users distinguish between an LLM’s internal parametric knowledge and external graph-sourced evidence. Furthermore, the visualization helped experts identify inaccuracies within the underlying knowledge graph itself, leading to more calibrated trust in the model’s output. We conclude by discussing opportunities to leverage visualizations to further support verification regarding unknown unknowns, personalization, and limitations of knowledge graphs.

[HC-13] “Where is this coming from?” Uncovering Trustworthiness Ideals in AI-powered Peripartum Information Seeking

链接: https://arxiv.org/abs/2606.10158
作者: Vaibhav Balloli,Julia Erickson,Xinyi Li,Erin MacMurray van Liemt,Alex Peahl,Elizabeth Bondi-Kelly
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted at ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2026

点击查看摘要

Abstract:AI-powered tools increasingly promise to fill information gaps in health, especially in domains like maternal and reproductive health that demand timely, accurate, and actionable information. This is extremely important, as the United States leads peer nations in preventable deaths, with stark racial disparities. However, current AI and NLP-powered systems aim to improve access to vetted maternal health information by routing user queries to a factual response while under-specifying the socio-technical governance structures that shape trust, use, and harm in practice. We report findings from four synchronous focus groups ( n=24 ) with three stakeholder groups central to peripartum information support: birthing people, clinicians, and health workers (e.g., doulas, social workers, community health workers) exploring topics around information seeking, experience with current clinical infrastructure, misinformation, and an AI-enabled factual answering tool design probe. Our inductive analysis surfaces a central finding: in high-stakes health contexts shaped by historical inequities, trustworthiness must be inspectable and not asserted. While stakeholders diverge on what makes information credible, they converge on the need for transparency, recourse, and ecosystem complementarity. Based on the discussions, we identify four themes and governance requirements: (1) support for social and identity-based sensemaking, (2) pluralistic verification practices, (3) inspectable governance with recourse mechanisms, and (4) ecosystem-aware integration that avoids shifting burden. Building on these findings, we propose design artifacts that are mistrust-aware and promote principled governance mechanisms for transparent, pluralistic AI systems. Finally, we discuss the implications of our findings for expanding human-AI evaluations and improving the transparency of deployed AI systems.

[HC-14] LLM -Based Visualization Evaluation: How Well Do Literacy-Stratified Personas Approximate Human Judgments?

链接: https://arxiv.org/abs/2606.10095
作者: Swaroop Panda
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Evaluating data visualizations across diverse user populations continues to pose a significant methodological challenge within visualization research. We propose a theorized evaluation framework, Literacy-Stratified LLM Evaluation (LSLE), which formalizes a two-stage process. The first stage involves constructing visualization literacy personas grounded in established frameworks such as VLAT. The second stage directs large language models to adopt these personas as simulated evaluators of visualization artifacts. We ground the framework in an epistemic analysis that characterizes the conditions under which LLM persona simulation may produce plausible proxies for literacy-dependent perception - and, critically, the conditions under which it does not - engaging directly with emerging critiques of LLM-as-participant paradigms from the VIS and HCI literature. To empirically test LSLE’s boundaries, we benchmark its outputs against openly available human response data from the validation studies of two established instruments: VLAT and BeauVIS. Using the same stimuli and assessment items as the original human studies, we compare LSLE persona responses across literacy strata against published human distributions and against default (non-persona) LLM baselines. Our analysis reveals where literacy-stratified personas converge with and diverge from human response patterns - identifying task types and evaluation dimensions where persona simulation approximates human variability and where it systematically fails. We discuss implications for the responsible use of LLM-assisted evaluation as a complement to empirical methods, and propose boundary conditions for when LSLE may be most appropriate: early-stage design exploration and rapid comparative screening rather than summative evaluation.

[HC-15] he Empirically Grounded Adaptive Virtual Patient for Psychotherapy Training: Disclosure That Responds to Therapist Micro-Skills

链接: https://arxiv.org/abs/2606.10051
作者: Angela Chen,Siwei Jin,Catherine Bao,Canwen Wang,Robert E. Kraut,Tongshuang Wu,Haiyi Zhu
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Simulated patients offer a scalable way to train psychotherapy micro-skills such as empathic responding and exploratory probing, but current systems either follow fixed scripts or rely on LLMs that drift unpredictably over long sessions. We present the Adaptive Virtual Patient (AVP), which adapts its disclosure behavior – from guarded, through moderate openness, to full disclosure – in response to trainee skill. The AVP is grounded in a structural equation model fit to nearly 2,000 hours of real-world psychotherapy transcripts, which quantifies how therapist empathy and exploration shift a patient’s openness over time. An LLM generates the AVP’s utterances conditioned on a disclosure level that the dynamics module updates each turn. In an evaluation with 20 clinicians and trainees over 80 sessions (1,033 turns), the AVP’s disclosure rises in response to therapist empathy and exploration, while a prompt-only baseline stays flat; ablations confirm that the empirically motivated parameterization outperforms alternatives, with exploration carrying most of the adaptive signal.

[HC-16] On the Controllability-Fidelity Frontier in Diffusion Editing

链接: https://arxiv.org/abs/2606.09901
作者: Yi Hu,Leying Yi,Emily Davis,Finn Carter
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Preprint

点击查看摘要

Abstract:Diffusion-based generative models enable powerful image editing capabilities, but achieving precise control while maintaining fidelity and safety remains challenging. We present a comprehensive theoretical and empirical study of controllable diffusion-based image editing, analyzing the trade-offs between adherence to user intent, preservation of non-target content, and output quality. Our work spans text- and mask-guided edits, point/drag manipulation, and inversion-based pipelines. We derive mathematical formulations of editing objectives and analyze dynamics of noise injection, score guidance, and inversion error. We provide theoretical bounds on reconstruction error, stability under repeated edits, and locality of changes. We propose algorithmic frameworks (with pseudocode) for mask-localized and instruction-guided editing, and present extensive experiments comparing state-of-the-art methods (e.g.\ TF-ICON \citelu2023tficone, DragFlow \citezhou2025dragflow, InstructPix2Pix \citebrooks2023instructpix2pix, UltraEdit \citezhao2024ultraedit) on multiple tasks and metrics (FID, identity similarity, CLIP alignment, artifact scores, etc). Our results reveal key failure modes, such as identity drift, prompt sensitivity, and compositional errors. We also discuss ethical considerations in image editing, including misuse risks, bias, consent, and concept erasure techniques (e.g.\ MACE \citelu2024mace, ANT \citeli2025ant, EraseAnything \citegao2024eraseanything) as safeguards. We conclude with best practices and future directions for responsible, high-fidelity diffusion-based editing.

[HC-17] ECHO: Explainable Co-editing with Human-in-the-loop Operations for Presentation Refinement

链接: https://arxiv.org/abs/2606.09851
作者: Yu Fu,Yongqi Kang,Yujia Zhou,Yong Zhao
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Authoring and refining presentation slides is a highly time-consuming core task in academic and business domains. While generative AI tools have lowered the barrier for creating initial drafts, their “black-box, one-way generation” paradigm severely deprives users of fine-grained control. Through a formative study (N=10), we identified “trial-and-error anxiety” and “inconsistent cross-page formatting” as primary bottlenecks in human-AI co-creation. Consequently, we present ECHO, an interactive system based on multimodal intent grounding and explainable operation plans. ECHO enables precise local edits via a “natural language + visual selection” paradigm, utilizing a decoupled “Plan-Confirm-Execute” loop and dynamic memory mechanisms to transform implicit AI intents into highly controllable layout co-creation. To systematically evaluate document refinement, we propose the CoEdit-Eval framework. Objective evaluations across multiple foundation models (e.g., GPT-5, GLM-4.7) demonstrate that while baselines uniformly fail in intent mapping (0% accuracy) and spatial grounding (0% Hit@1), the ECHO architecture boosts Target Hit@1 to 55%–85% depending on the base model. Furthermore, integrating Vision-Language Models (VLMs) effectively resolves spatial ambiguities – achieving significant win rates in LLM blind evaluations – and our Undo mechanism guarantees 100% physical file consistency (MD5 hash). Finally, a controlled study with 14 participants shows that ECHO significantly reduces cognitive workload (NASA-TLX scores dropped by 20.8%, from 82.6 to 65.4) and reveals the dynamic evolution of human control allocation across different cognitive tasks. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.09851 [cs.HC] (or arXiv:2606.09851v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.09851 Focus to learn more arXiv-issued DOI via DataCite

[HC-18] Sketch-to-Layout: A Human-Centric Computational Agent for Constraint-Aware Synthesis of Modular Photobioreactors

链接: https://arxiv.org/abs/2606.09849
作者: Xiujin Liu,Shuqi Li,Yuxin Lin
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Building-integrated photobioreactors (PBRs) offer a pathway for carbon-neutral architecture, yet deployment is hindered by configuration complexity and biological maintenance. This paper presents a modular PBR facade system powered by a computational framework reconciling design intent with physical validity. We introduce ‘carbon-neutralization bricks’ featuring integrated vessel-and-conduit geometry; monolithic fluid channels enable ‘plug-and-play’ assembly. To navigate the combinatorial complexity of 14 modular geometries, we develop a Computational Sketch-to-Layout Agent that formulates layout synthesis as a Constraint Satisfaction Problem (CSP). Using the CP-SAT engine, the agent treats sparse user sketches as soft priors while enforcing hard constraints like port alignment and global connectivity. This allows non-experts to synthesize fabrication-ready configurations in near real-time. Furthermore, to facilitate autonomous maintenance, we propose a weakly supervised algae health monitoring pipeline. By employing a hybrid CNN-attention backbone and a temporal ranking loss, the system quantifies biological vitality from photographs without absolute ground-truth labels. Experiments demonstrate the CSP solver achieves a 95.5% success rate on grid scales up to 15 x 15. Qualitative evaluations confirm the framework preserves design semantics while ensuring operational integrity. Long-term tests show the vision module produces health trajectories aligned with 14-day biological cycles, suggesting that integrating interactive synthesis with low-cost computer vision can democratize scalable carbon capture systems. Comments: 13 pages, 6 figures Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.09849 [cs.HC] (or arXiv:2606.09849v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.09849 Focus to learn more arXiv-issued DOI via DataCite

[HC-19] Human-AI Coordination Zones: A Framework for Designing Human-in-the-Loop Experiences with Agent ic AI

链接: https://arxiv.org/abs/2606.09848
作者: James Pierce,Vaiva Kalnikaitė,Siddharth Gupta,Brian Granger
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As generative and agentic AI becomes embedded in everyday products, practitioners face a persistent challenge: how to design human-AI coordination – the ongoing mutual adjustment between users and AI systems as mediate through interfaces-that supports usability, trust, and safety. Existing resources offer high-level principles (“be transparent,” “maintain user control”) or low-level UI patterns, but there is a lack of mid-level design knowledge bridging the two. Through landscape and artifact analysis of 60 commercial AI applications, we introduce a framework defining human-AI coordination as the interplay of three dimensions: salience (how prominently AI is presented), involvement (what users can do to engage AI), and activity (what AI actually does). We contribute mid-level tools including coordination zones (done-for-me, done-under-me, done-with-me, done-without-me), an input taxonomy (prompted, sparked, inferred, layered), coordination curves for mapping user journeys, and design patterns demonstrating the generative capacity of the framework. The framework can be applied generatively to design experiences, analytically to evaluate existing ones, and communicatively to articulate ideas across stakeholders.

[HC-20] Designing Vibes in a Science Museum: from @Science to @hugging_face

链接: https://arxiv.org/abs/2606.09847
作者: Derya Akbaba,Daniela Paz Moyano Dávila,Måns Gezelius,Yin He,Miriah Meyer
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While feminist and critical data theories have long critiqued the use of data to uphold a positivist-informed view about science, few examples offer alternative methods to display scientific constructs. In response, we present Data and Me: an exhibit informed by feminist and critical data theories, which we designed and launched at a local science museum. Data and Me introduces museum visitors to data using a [AT]:hugging_face: vibe – a vibe that signals that data can be [hashtag]slow, [hashtag]handmade, and [hashtag]personal. We designed this vibe to be noticeably different than the [AT]Science vibe in the rest of the museum. Throughout our design process, we adapted visualization vibes as an analytic and generative tool in the context of a science museum. We present four design choices that enable the design of a vibe: visual, topical, material, and crediting. We discuss how our exhibit aligns with ongoing discussions about alternative research outcomes and calls for plurality in HCI.

[HC-21] CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

链接: https://arxiv.org/abs/2606.09846
作者: Vignesh Nagarajan
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 16 figures, 3 tables, 21 references

点击查看摘要

Abstract:Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the sensory, spatial, or emotional qualities of an artwork. This study presents an automated workflow that generates multi-sensory art descriptions and synchronized audio narration using large language models and text-to-speech services. The system, orchestrated through Zapier, converts uploaded images into rich narrative captions without human intervention, enabling rapid, scalable production of accessible media. Quantitative evaluation across 50 artworks shows that AI-generated descriptions contain significantly higher lexical diversity, adjective density, and narrative detail than baseline captions, while maintaining comparable readability levels. Statistical tests (t-tests, ANOVA) confirm meaningful differences in richness and length, and the full pipeline produces text-plus-audio outputs in under 20 seconds per image at a cost below 0.05. Findings demonstrate that automated captioning can bridge gaps in museum and digital-collection accessibility, with implications for broader public engagement. Future work can incorporate user studies with BLV participants to assess comprehension, preference, and optimal levels of interpretive language.

[HC-22] utor Not Solver: Designing a Guardrailed AI Assistant for Learning in Higher Education: A Design Case of PeteChat

链接: https://arxiv.org/abs/2606.09845
作者: Belle Li,Lily Tan,Wei Zakharov,Qiang Qiu,Colby Ben Acton
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注: Preprint. Includes supplementary appendices, interface figures, and baseline-analysis tables

点击查看摘要

Abstract:Generative AI tutors hold significant promise for higher education, yet designing systems that scaffold learning without undermining academic integrity remains an open design challenge. This paper presents PeteChat, a course-aligned AI tutor developed and deployed at Purdue University, documented through the lens of design-based research (DBR). Drawing on literature-informed design inputs, a pre-deployment baseline analysis of authentic student-system interactions, and formative expert evaluation with teaching assistants and UX/developer stakeholders, we report eight transferable design principles for assessment-aware AI tutors: from homework guardrails and debugging scaffolds to self-regulated learning support and instructor-facing customization tools. The system is built on a locally hosted Llama-3 model enhanced with retrieval-augmented generation (RAG) grounded in course-specific materials. Rather than reporting controlled experimental outcomes, this design case foregrounds the situated design reasoning, iterative refinement, and principled decision-making that shaped PeteChat across multiple development phases. The resulting principles and methodological approach offer actionable guidance for institutions seeking to deploy responsible, integrity-preserving AI tutors at scale.

[HC-23] he Interlocutor Effect: Why LLM s Leak More Personal Data to Agents Than Humans

链接: https://arxiv.org/abs/2606.09844
作者: Faouzi El Yagoubi,Godwin Badu-Marfo,Ranwa Al Mallah
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) alter their privacy behavior based on the perceived identity of their interlocutor. While safety mechanisms typically prevent LLMs from releasing Personally Identifiable Information (PII) to human users, these models tend to reveal more sensitive data when addressing another AI agent. We refer to this as the \textbfInterlocutor Effect. Through an ablation study, we find evidence that the technical nature of the recipient contributes to this effect, thereby diminishing the model’s caution regarding privacy. To explore this further, we introduce the Attention Suppression Hypothesis, which posits that safety-aligned attention heads become inactive during interactions with agents. We assess this quantitatively by comparing human-directed and agent-directed prompts in 222 sensitive scenarios. Our findings, drawn from 3,464 interactions, indicate that portraying the recipient as an AI agent elevates PII leakage by up to 23 percentage points. Initial experiments on Llama-3.1-8B-Instruct corroborate this: deactivating one safety head induces leakage, whereas reactivating it reinstates privacy safeguards. We consider the implications for developing secure multi-agent systems. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.09844 [cs.HC] (or arXiv:2606.09844v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.09844 Focus to learn more arXiv-issued DOI via DataCite

[HC-24] An LLM -Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

链接: https://arxiv.org/abs/2606.09843
作者: Juan Manuel Contreras
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deeper property of LLM self-report itself, has been unresolved. We constructed the first psychometric instrument whose constructs are derived bottom-up from LLM behavioral affordances via exploratory factor analysis (EFA). We administered 300 items (240 direct Likert + 60 scenario-based) spanning 12 candidate behavioral dimensions to 25 LLMs across 17 model families, each item administered 30 times. EFA yielded a 5-factor structure – Responsiveness, Deference, Boldness, Guardedness, and Verbosity – with excellent split-half replicability (all Tucker \phi \geq .957 ) and internal consistency (all \alpha \geq .930 ). To test predictive validity, we collected 2,500 open-ended behavioral samples rated by 151 human raters and a three-judge LLM ensemble. Human and judge ratings agreed ( \barr = .51 ), but neither tracked self-report: self-report–human \barr = -.01 , self-report–judge \barr = .13 , with no factor-level self-report–human CI excluding zero. On Responsiveness, self-report correlated with LLM judges ( r = .53 ) but not humans ( r = .04 ), even though humans and judges agreed ( r = .59 ) – indicating self-report items and LLM judges share variance that human observers do not, a confound invisible to within-ensemble reliability checks. We release the instrument as a diagnostic probe for alignment-shaped self-description and a concrete risk factor for LLM-as-judge pipelines.

[HC-25] Integrated Real-Time Motion Tracking and AI Analysis for Athletic Performance Optimization

链接: https://arxiv.org/abs/2606.09842
作者: Parth Agrawal,Ronit,Sagar Kumar,Aashish Bhambri
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 10 figures, 2 tables, IC2E3-2026 conference

点击查看摘要

Abstract:Applying Human Pose Estimation (HPE) in real world environments remains a challenging task, this paper explores and surveys real time HPE approaches and their limitations in sports analysis for individuals, alongside developing a practical lightweight prototype for real world testing and usage. The older marker-based motion capture systems evolving to the modern accessible and adaptable markerless deep learning approaches, this survey explores the foundational architectures, which balance precision and efficiency. We also compare algorithmic frameworks (top-down, bottom-up, one-stage approaches, etc.) on practical deployment metrics such as inference latency, frame rate, mean per-joint position error, and temporal jitter to guide model selection process for sports application. As our prime contribution, we are proposing a modular, lightweight software prototype, which uses MediaPipe HPE framework with multiple exercise specific logic to deliver real-time insights and AI based feedback for non-expert users. We derive sports insights and providing feedback with minimal computational resources, while showcasing the performance and reliability metrics. In the end, we suggest other future research directions like combining sensors, and AR/VR. This work caters to researchers, engineers, sport scientists, etc., as both technical resource and a valid blueprint to implement a similar or improved real-time HPE analysis system for athletic performance enhancement or other purposes.

[HC-26] Human-Centered AI for Safe Shuttle Car Routing in Underground Room-and-Pillar Coal Mines Using Graph Neural Networks

链接: https://arxiv.org/abs/2606.09841
作者: Bryant Pollard
类目: Human-Computer Interaction (cs.HC)
备注: 18 pages, 3 figures, 13 tables

点击查看摘要

Abstract:Underground room-and-pillar coal mining requires shuttle car operators to make safety-critical routing decisions under conditions of low visibility, dynamic miner movement, congestion, and limited real-time information. This paper presents a human-centered AI decision-support system that recommends safe shuttle car routes using a Graph Neural Network (GNN) trained on expert-informed synthetic data and deployed through a browser-based interface backed by cloud inference services. Rather than making a purely model-centric contribution, the paper shows how interviews, participatory design, usability testing, interaction logs, and interpretability analysis shaped both the interface and the AI model. The resulting application evolved beyond route recommendation to include inline user feedback, blockage reporting, audio cues, and SHAP-based interpretability within a single interactive system. Evaluation across two usability sessions with six participants showed improved task completion, faster response times, fewer errors, higher usability scores, lower cognitive load, and stronger agreement with AI recommendations in the later version. The paper contributes a focused example of how human-centered design can transform an AI routing prototype into a more transparent, auditable, and safety-supportive decision-support system.

[HC-27] Aesthetic Perspectives in Information Systems Research: A Hermeneutic Analysis

链接: https://arxiv.org/abs/2606.09839
作者: Angelina Chen,Rick Sullivan,Raffaele F Ciriello
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy

点击查看摘要

Abstract:How might implicit aesthetic perspectives shape what Information Systems (IS) scholarship recognises as worthy of study (or not)? In this hermeneutic literature analysis, we surface foundational aesthetic assumptions underpinning IS research. We identify four perspectives (aesthetics as imitation, sensory experience, world-making, and political doing) that guide how IS scholars perceive and appreciate sociotechnical phenomena. These perspectives influence what becomes recognisable as legitimate research and what remains unseen. By making aesthetic assumptions explicit, we show how they form epistemic infrastructure that conditions horizons of inquiry. We apply this framework to algorithmic management and digitally mediated intimacy, revealing how alternative perspectives open new research questions whilst exposing dimensions that dominant framings overlook. This analysis foregrounds the importance of aesthetic philosophy to IS literature, offering a vocabulary for articulating how aesthetic perspectives shape theorising, method, and contribution.

[HC-28] Popularity Without Legitimacy? Comparing Trust in Television Meteorologists and YouTube Weatherfluencers

链接: https://arxiv.org/abs/2606.09838
作者: Julie A. Vera,David W. McDonald,Mark Zachry
类目: Human-Computer Interaction (cs.HC)
备注: 14 pages, conditionally accepted at Proceedings of the 23rd ISCRAM Conference, The Hague, the Netherlands June 2026

点击查看摘要

Abstract:During severe weather events, people must interpret rapidly evolving information to make time-sensitive safety decisions. Broadcast meteorologists have traditionally served as credentialed intermediaries within established media organizations, while independent “weatherfluencers” on YouTube have emerged as prominent real-time interpreters for large and growing audiences. This mixed-methods study provides one of the first empirical comparisons of how viewers evaluate broadcast meteorologists against YouTube weatherfluencers across credibility, legitimacy, objectivity, and practical utility. Broadcast meteorologists were consistently rated higher on credibility, legitimacy, and safety utility, while weatherfluencers achieved parity on objectivity. Yet weatherfluencer audiences continue to grow, revealing a critical decoupling between audience attention and official or professional authorization that existing crisis communication models do not fully account for. Qualitative findings illuminate the mechanisms underlying these judgments and their implications for emergency communication in hybrid information ecosystems.

[HC-29] Self-EmoQ: Plutchik-Guided Value-based Planning to Drive Streaming Emotional TTS ACL2026

链接: https://arxiv.org/abs/2606.09837
作者: Yue Zhao,Hongyan Li,Yong Chen,Luo Ji
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Emotional interaction is increasingly crucial for conversational AI, yet current systems lack a self-emotion determination mechanism to drive the streaming text-to-speech (TTS) synthesis. We propose an emotion-planning framework that determines the emotion prior to the textual generation, grounding the downstream emotional TTS in a streaming manner. The framework is implemented by a plug-and-play LLM module, initialized from pretrained LLMs, and trained by reinforcement learning (RL) with emotions as the actions. A hybrid reward is employed which combines imitation signals with theory-driven scoring, in which the theory of Plutchik’s wheel of emotions is adopted. By experiments on DailyDialog, EmoryNLP, IMEOCAP, and MELD, our method outperforms prompting and finetuning baselines on both emotion determination and response quality. We finally implement an entire streaming pipeline for real-time deployment, with the speech quality confirming the framework’s emotional alignment, contextual coherence, and expressive fluency. Codes, cases, and demos are available in this https URL.

[HC-30] Equanimity in HRI: Applying Calm Technology Principles to Human-Robot Interaction

链接: https://arxiv.org/abs/2606.09836
作者: Barbara Sienkiewicz,Bipin Indurkhya
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Conference pre-print. this https URL

点击查看摘要

Abstract:This paper explores how \textitCalm Technology can be integrated into Human-Robot Interaction (HRI), with a particular focus on the household environment. It offers comprehensive guidelines for designing assistive robots that prioritize and enhance the human need for \textitequanimity, ensuring interactions are calm, non-intrusive, and harmonious. The paper examines the widespread influence of technology in contemporary life and its impact on cognitive capabilities, underscoring the need for responsible robotics and ethical considerations in future technological developments. By adapting \textitCalm Technology principles to domestic robots, the article provides concrete examples and features that should be employed in household assistive robotics. The goal is to foster a balanced, unobtrusive interaction between humans and robots, especially in the home environment, as it is the most privat environment in everyone’s life, paving the way for applications and further research in the field.

[HC-31] hinking Inside the Box: Considerations for Putting Data Physicalization Workshops in a Box

链接: https://arxiv.org/abs/2606.09835
作者: Derya Akbaba,Camilla Svensson,Claudia Torelli,Martin Callmeryd,Miriah Meyer
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Visualization researchers utilize workshops both for applied research and to engage different populations with visualization-based activities. While there are many benefits to running visualization workshops, their utility and impact rely on the presence of a researcher who has deep knowledge about visualization theory and practice. In this work, we introduce workshop-in-a-box as a design concept intended to challenge the researcher-centric approach to data physicalization workshops. Through a design study with a socially innovative organization, we deployed several data physicalization workshops that our collaborator ran instead of us. Based on this experience, along with two accompanying case studies that validate the concept, we present material and procedural considerations for how to put data physicalization workshops into a box and the implications it has for extending visualization research outside the bounds of academia.

[HC-32] Weather Synchronization in Digital Twin Environments for Shared VR Experience Using Commercial Metaverse Platforms

链接: https://arxiv.org/abs/2606.09834
作者: Masanori Ibara,Yuichi Hiroi,Takushi Kamegai,Yusuke Masubuchi,Kazuki Matsutani,Megumi Zaizen,Junya Morita,Takefumi Hiraki
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, presented at IEEE ISEMV 2025

点击查看摘要

Abstract:Digital twin technology creates bidirectional synchronization between physical and virtual environments, yet current implementations fail to provide authentic environmental experiences that enhance user presence in shared virtual spaces. While digital twin environments using commercial metaverse platforms for IoT sensor data visualization have been proposed, translating environmental information into meaningful sensory experiences remains largely unexplored, particularly lacking approaches for weather conditions that significantly influence spatial perception. We developed a weather synchronization system that integrates real-time environmental data from ``The GEAR’’ smart building with the Cluster metaverse platform, enabling shared VR experiences with authentic atmospheric immersion. Our system processes temperature, humidity, precipitation, wind speed, and solar radiation measurements to generate corresponding virtual environmental effects including dynamic sky rendering, precipitation particles, and ambient audio modulation. Performance evaluation demonstrated practical response times of 0.8-1.0 seconds for weather data transmission and reflection in the virtual environment. This work establishes a foundation for seamless physical-digital space integration, potentially enhancing remote collaboration efficiency and facilitating more dynamic discussions in shared virtual workspaces.

[HC-33] CollabSkill: Evaluating Human-Agent Collaboration On Real-World Tasks

链接: https://arxiv.org/abs/2606.09833
作者: Yijia Shao,Zora Zhiruo Wang,Neel Ahuja,Yicheng Wang,Bowen Liu,Diyi Yang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 11 pages of main paper, preprint (under review)

点击查看摘要

Abstract:AI agents are reshaping the workspace, leading to drastic change of how humans work. Despite the considerable potential of human-agent collaboration both in preserving human agency and generating economic value, this paradigm remains largely absent from occupational task evaluation, hindered by the difficulty of gathering real human data and accounting for inter-human variability. We introduce CollabSkill, a framework for evaluating human-agent collaboration on real-world occupational tasks. CollabSkill pairs real human workers with AI agents on tasks matched to their occupational background, collecting data that capture the complexity of economically valuable tasks and the usage patterns of real workers. To account for inter-human variability, CollabSkill employs a Bayesian skill rating system to disentangle and quantify the skill contributions of both humans and AI agents. Drawing on over 1,500 prompts from 386 working sessions contributed by 93 human workers, our analysis yields insights on two fronts: on the agent side, rankings on CollabSkill diverge meaningfully from those of existing fully autonomous benchmarks where Codex leads, with Claude Code ranking first; on the human side, CollabSkill reveals that practical experience emerges as the primary driver of collaboration skill, with hands-on collaboration meaningfully shifting workers’ AI literacy. Together, we hope CollabSkill enables the community to invest in systematic evaluation of human-agent collaboration and spurs development efforts aimed at building AI agents that genuinely augment human workers.

[HC-34] Agent ic Social Affordance Framework (ASAF): Agent Identity Design as a Collaboration Interface in Multi-Agent Systems

链接: https://arxiv.org/abs/2606.09832
作者: Meng-Han Lee
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 24 pages, 2 figures, 1 table. Introduces ASAF with falsifiable hypotheses and proposed experimental designs for testing agent identity design effects in multi-agent Human-in-the-Loop systems, grounded in a real-world 38-agent deployment

点击查看摘要

Abstract:As AI systems evolve from single conversational agents to complex multi-agent architectures, a critical design dimension has been overlooked: how the social identity of individual agents shapes human behavior within the collaboration. This paper introduces the Agentic Social Affordance Framework (ASAF), a theoretical framework that extends Social Affordance theory into the context of multi-agent AI systems. We propose that agent identity design functions not merely as a user interface convention, but as a collaboration interface – structuring how users perceive, approach, and engage with each agent, and thereby influencing the quality of Human-Agent collaboration outcomes. Specifically, the social affordance layer constitutes an independent design dimension orthogonal to engineering orchestration: the two represent distinct decision spaces that cannot be derived from each other. ASAF comprises three mechanisms: Identity Signaling, Behavioral Priming, and Collaborative Governance, and specifies their boundary conditions through a four-tier Identity Signal Fidelity Spectrum and an individual-difference moderating variable (anthropomorphizing vs.\ instrumentalizing cognitive style). We situate ASAF in relation to existing affordance theory and the CASA paradigm, delineating where ASAF’s multi-agent, topology-level predictions exceed the explanatory scope of dyadic frameworks. We discuss implications for multi-agent system design and outline directions for future empirical validation, including a factorial design for testing design-space orthogonality.

[HC-35] AI-Driven Analytics of Team-Teaching Talk: Acoustic Patterns across Experience Cohorts and the Learning Design

链接: https://arxiv.org/abs/2606.09831
作者: Yuchen Liu,Roberto Martinez-Maldonado,Riordan Alfredo,Paola Mejia-Domenzain,Dwi Rahayu,Sadia Nawaz
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at AIED 2026 (International Conference on Artificial Intelligence in Education), 14 pages, 4 figures

点击查看摘要

Abstract:As classroom cohorts expand, team teaching is increasingly used to integrate the expertise and pedagogical perspectives of multiple teachers. Yet, there is limited empirical understanding of how team teaching unfolds in practice, particularly regarding differences in teachers’ contributions across experience levels, student cohorts, and learning task design. Prior research on team teaching has largely relied on retrospective self-reports or small-scale observations, offering limited insight into the micro-level processes through which team teaching is enacted. Teacher talk offers a scalable lens on these processes. While research in individual teaching contexts shows that acoustic features of speech (e.g., voice quality, intonation, and loudness) can shape student learning, evidence from team-teaching settings remains scarce. Moreover, capturing such features through manual observation or transcription is especially challenging in team-teaching classrooms, where multiple teachers speak across extended sessions and spatial locations, limiting scalability without automation. Grounded in spatial pedagogy theory and team-teaching research, this paper presents an AI-based speech processing approach to analyse classroom talk in team-teaching settings. We analysed 36 recorded undergraduate and postgraduate sessions involving 12 teachers. Spatial pedagogy behaviours were coded and acoustic features extracted to examine variation across teachers’ experience, student cohorts, and the learning task design. The results reveal systematic differences, most notably in loudness dynamics: high-experience teachers, undergraduate classes and collaborative learning tasks exhibited greater loudness variation, suggesting more frequent modulation of volume to foreground key information and support classroom interaction and engagement.

[HC-36] Culturally-Aware AI for Cross-Boundary Community Learning: Undergraduate Innovation at the Intersection of Computation and Design

链接: https://arxiv.org/abs/2606.09041
作者: Jiaojiao Zhao,Weisheng Zhang,Jiawen Cai,Haibin Gao,Luyao Zhang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Research on artificial intelligence in education (AIED) is rapidly expanding, yet technical progress often lacks human-centered grounding and adequate attention to cultural context. Community-Based Learning, a pedagogy rooted in social work, remains underrepresented in AIED research, particularly within Asia-Pacific contexts. This paper reports on cross-boundary Community-Based Learning where undergraduate students develop AI-enabled solutions for cultural heritage preservation and sustainable development. We examine how community-engaged computing operationalizes human-centered AIED across three dimensions: education, technology, and culture. We contribute a collaborative framework for culturally-aware AIED that fosters multi-stakeholder collaboration while widening participation by dissolving disciplinary silos between social work and computational science.

[HC-37] owards Gaze-Informed AI Disclosure Interfaces: Eye-Tracking Attentional and Cognitive Load While Reading AI-Assisted News

链接: https://arxiv.org/abs/2605.14999
作者: Pooja Prajod,Hannes Cools,Thomas Röggla,Pablo Cesar,Abdallah El Ali
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As generative AI becomes increasingly integrated into journalism, designing effective AI-use disclosures that inform readers without imposing unnecessary burden is a key challenge. While prior research has primarily focused on trust and credibility, the impact of disclosures on readers’ attentional and cognitive load remains underexplored. To address this gap, we conducted a 3\times2\times2 mixed factorial study manipulating the level of AI-use disclosure detail (none, one-line, detailed), news type (politics, lifestyle), and role of AI (editing, partial content generation), measuring load via NASA-TLX and eye-tracking. Our results reveal a significant attentional cost: one-line disclosures resulted in significantly higher fixation durations and saccade counts, particularly for AI-edited content. Detailed disclosures did not impose additional burden. Drawing on Information-Gap Theory, we argue that brief labels may trigger increased visual scrutiny by alerting readers to AI use without providing enough information. NASA-TLX scores and pupil diameter showed no significant differences across conditions, suggesting that AI-use disclosures do not impose cognitive burden regardless of the detail level. Interview insights contextualize these findings and reveal a strong preference for detailed or ``detail-on-demand’’ designs. Our findings inform the design of gaze-informed adaptive disclosure interfaces that dynamically adjust transparency levels based on readers’ attentional patterns and news context.

[HC-38] More Human or More AI? Visualizing Human-AI Collaboration Disclosures in Journalistic News Production

链接: https://arxiv.org/abs/2601.11072
作者: Amber Kusters,Pooja Prajod,Pablo Cesar,Abdallah El Ali
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to ACM CHI 2026 - Preprint

点击查看摘要

Abstract:Within journalistic editorial processes, disclosing AI usage is currently limited to simplistic labels, which misses the nuance of how humans and AI collaborated on a news article. Through co-design sessions (N=10), we elicited 69 disclosure designs and implemented four prototypes that visually disclose human-AI collaboration in journalism. We then ran a within-subjects lab study (N=32) to examine how disclosure visualizations (Textual, Role-based Timeline, Task-based Timeline, Chatbot) and collaboration ratios (Primarily Human vs. Primarily AI) influenced visualization perceptions, gaze patterns, and post-experience responses. We found that textual disclosures were least effective in communicating human-AI collaboration, whereas Chatbot offered the most in-depth information. Furthermore, while role-based timelines amplified AI contribution in primarily human articles, task-based timeline shifted perceptions toward human involvement in primarily AI articles. We contribute Human-AI collaboration disclosure visualizations and their evaluation, and cautionary considerations on how visualizations can alter perceptions of AI’s actual role during news article creation.

[HC-39] Full Disclosure Less Trust? How the Level of Detail about AI Use in News Writing Affects Readers Trust

链接: https://arxiv.org/abs/2601.09620
作者: Pooja Prajod,Hannes Cools,Thomas Röggla,Karthikeya Puttur Venkatraj,Amber Kusters,Alia ElKattan,Pablo Cesar,Abdallah El Ali
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) is increasingly integrated into news production, calls for transparency about the use of AI have gained considerable traction. Recent studies suggest that AI disclosures can lead to a ``transparency dilemma’‘, where disclosure reduces readers’ trust. However, little is known about how the \textitlevel of detail in AI disclosures influences trust and contributes to this dilemma within the news context. In this 3 \times 2 \times 2 mixed factorial study with 40 participants, we investigate how three levels of AI disclosures (none, one-line, detailed) across two types of news (politics and lifestyle) and two levels of AI involvement (low and high) affect news readers’ trust. We measured trust using the News Media Trust questionnaire, along with two decision behaviors: source-checking and subscription decisions. Questionnaire responses and subscription rates showed a decline in trust only for detailed AI disclosures, whereas source-checking behavior increased for both one-line and detailed disclosures, with the effect being more pronounced for detailed disclosures. Insights from semi-structured interviews suggest that source-checking behavior was primarily driven by interest in the topic, followed by trust, whereas trust was the main factor influencing subscription decisions. Around two-thirds of participants expressed a preference for detailed disclosures, while most participants who preferred one-line indicated a need for detail-on-demand disclosure formats. Our findings show that not all AI disclosures lead to a transparency dilemma, but instead reflect a trade-off between readers’ desire for more transparency and their trust in AI-assisted news content.

计算机视觉

[CV-0] ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

链接: https://arxiv.org/abs/2606.11188
作者: Junke Wang,Xiao Wang,Jiacheng Pan,Xuefeng Hu,Feng Li,Jingxiang Sun,Chaorui Deng,Zilong Chen,Yunpeng Chen,Kaibin Tian,Matthew Gwilliam,Hao Chen,Danhui Guan,Kun Xu,Weilin Huang,Zuxuan Wu,Haoqi Fan,Yu-Gang Jiang,Zhenheng Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: this https URL.

[CV-1] Next Forcing: Causal World Modeling with Multi-Chunk Prediction

链接: https://arxiv.org/abs/2606.11187
作者: Gangwei Xu,Qihang Zhang,Jiaming Zhou,Xing Zhu,Yujun Shen,Xin Yang,Yinghao Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next ^1 , next ^2 , next ^3 chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.

[CV-2] AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference ICML2026

链接: https://arxiv.org/abs/2606.11186
作者: Hangfeng Liang,Yutao Hu,Yanhan Hu,Xiaohan Wu,Wenqi Shao,Ying Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026; Project page and code: this https URL

点击查看摘要

Abstract:Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.

[CV-3] Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

链接: https://arxiv.org/abs/2606.11180
作者: Paul Hyunbin Cho(1),Jinhyuk Jang(1),SeokYoung Lee(1),Joungbin Lee(1),Siyoon Jin(1),Heeseong Shin(1),Jung Yi(1),Yunjin Park(2),Chulmin Park(2),Seungryong Kim(1) ((1) KAIST AI, (2) AIPARK)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, 17.6\times faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs 39.8\times faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.

[CV-4] Mean Flow Distillation: Robust and Stable Distillation for Flow Matching Models

链接: https://arxiv.org/abs/2606.11155
作者: An Zhao,Shengyuan Zhang,Zhongjian Sun,Yixiang Zhou,Zejian Li,Ling Yang,Tianrun Chen,Lingyun Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow Matching models have demonstrated strong performance across a wide range of generative tasks. However, their reliance on ODE-based iterative sampling incurs substantial computational overhead in inference, which limits their applicability in real-time scenes. While distillation is a promising solution, existing approaches largely borrow from diffusion-based score matching, often failing to exploit the intrinsic geometric structure of flows and suffering from training instability, high variance, and degraded generation quality. In this paper, we propose Mean Flow Distillation (MFD), a novel distillation framework tailored for flow matching models. We theoretically demonstrate that MFD acts as a temporal low-pass filter, effectively suppressing the high-frequency optimization noise inherent in variational score distillation (VSD) while ensuring global trajectory consistency. We further prove the Mean Flow Matching Theorem, establishing that matching expected average velocities is sufficient for strict distribution alignment. Empirically, on challenging tasks of high-dimensional manifolds including 4D occupancy forecasting and text-to-image generation, MFD achieves state-of-the-art performance, enabling high-fidelity single-step generation.

[CV-5] P3D-Bench: Benchmarking MLLM s for Parametric 3D Generation and Structural Reasoning

链接: https://arxiv.org/abs/2606.11152
作者: Yikang Yang,Zhanpeng Hu,Youtian Lin,Mengqi Zhou,Jingxi Xu,Feihu Zhang,Jiaheng Liu,Yao Yao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design’s structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.

[CV-6] MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On CVPR2026

链接: https://arxiv.org/abs/2606.11148
作者: Xiaoyu Han,Chenyang Wang,Jing Wang,Shunyuan Zheng,Quanling Meng,Shengping Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Highlight)

点击查看摘要

Abstract:Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.

[CV-7] UniPET: a universal network for high-quality PET image denoising across varied dose reduction factors

链接: https://arxiv.org/abs/2606.11131
作者: Zhiwen Yang,Yang Zhou,Haowei Chen,Hui Zhang,Dan Zhao,Bingzheng Wei,Yan Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing deep learning-based PET image denoising methods assume a fixed and known dose reduction factor (DRF) for low-dose PET images. However, these methods encounter significant performance degradation when the DRF varies beyond the assumed one in practical applications. To address the challenge posed by varied DRFs, several preliminary studies focus on the task of universal PET image denoising, aiming to train a universal model over low-dose data across DRFs. Nonetheless, these vanilla universal models often struggle with misaligned styles present in different DRF data, leading to the \textitstyle elimination issue with a significant over-smoothing effect. To deal with this issue, we innovatively introduce domain generalization to PET image denoising and propose a universal PET image denoising network (UniPET) to achieve high-quality PET image denoising across diverse DRFs. UniPET comprises two primary innovations: a style alignment network (SAN) and a region-aware learning strategy (RALS). Specifically, SAN utilizes style alignment techniques derived from domain generalization to align and recover styles across different DRFs, ensuring the model’s generalizability across various DRFs while effectively preserving styles. Furthermore, to enhance style recovery, RALS distinguishes between flat and stylized regions, exclusively conducting adversarial learning on the latter, thereby more effectively guiding the model’s focus towards learning stylized regions. It is demonstrated that our proposed UniPET can adaptively recover different DRF styles and achieve high-quality PET image denoising across DRFs. Comprehensive experiments show that UniPET exhibits comparable performance to individual DRF-specific models at specific DRFs and realizes state-of-the-art performance in universal PET image denoising quantitatively, perceptually, and clinically.

[CV-8] WorldOlympiad: Can Your World Model Survive a Triathlon?

链接: https://arxiv.org/abs/2606.11129
作者: Yuke Zhao,Wangbo Zhao,Weijie Wang,Zeyu Zhang,Dakai An,Akide Liu,Yinghao Yu,Jiasheng Tang,Fan Wang,Wei Wang,Bohan Zhuang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Code: this https URL

点击查看摘要

Abstract:We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

[CV-9] Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football CVPR2026

链接: https://arxiv.org/abs/2606.11120
作者: Andrew Kang,Priya Narasimhan
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, CVSports Workshop

点击查看摘要

Abstract:We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent trajectories with ball interactions), and a policy over counterfactual actions (sampling pass variants with noise). Building on the first public high-fidelity tracking dataset with 3D ball trajectories from the Bundesliga, we introduce Monte Carlo Pass Search (MCPS), which infers kick parameters for each observed pass, samples execution variants and option variants, rolls each candidate forward with a ball-conditioned world model until the next ball interaction, and scores outcomes with a learned value model to obtain a distribution over gained value. This distribution enables distribution-aware attribution with two complementary execution-surplus scores used for analysis and ranking: mean-based and percentile-based scores. To make the world model sample-efficient under limited public data, we adapt a discrete-token, autoregressive trajectory generator from autonomous driving (SMART) and show it yields strong best-of-20 forecasting accuracy compared to baselines, while supporting fully hypothetical rollouts for downstream evaluation. We have released model checkpoints and code.

[CV-10] FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

链接: https://arxiv.org/abs/2606.11106
作者: Mahmood Alzubaidi,Uzair Shah,Raden Muaz,Ines Abbes,Nader Mohammed,Abdullatif Magram,Khalid Alyafei,Mowafa Househ,Marco Agus
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using this http URL with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at this https URL.

[CV-11] IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

链接: https://arxiv.org/abs/2606.11096
作者: Yitong Chen,Zijie Diao,Junke Wang,Lingyu Kong,Yixuan Ren,Bo He,Yu-Gang Jiang,Zuxuan Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.

[CV-12] U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training

链接: https://arxiv.org/abs/2606.11032
作者: Zhiwen Yang,Jiayin Li,Hao Lu,Hui Zhang,Zihua Wang,Bingzheng Wei,Yan Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of generalization stems from the conventional paradigm of fixed-parameter models that cannot adapt to variations in test data (e.g., dose levels or scanner types) after training. To overcome this limitation and achieve robust generalization, we introduce U-TTT, a novel U-shaped model that integrates Test-Time Training (TTT) layers to dynamically adjust model parameters during inference through self-supervision, thereby adapting to the specific characteristics of each test instance. Furthermore, to comprehensively capture the complex degradations of 3D PET data, U-TTT features a dual-domain adaptation mechanism comprising a Spatial Test-Time Training (S-TTT) layer and a Frequency Test-Time Training (F-TTT) layer. The S-TTT layer captures and corrects spatial structural degradations, while the F-TTT layer suppresses global noise spectra and restores delicate high-frequency details. Extensive experiments demonstrate that U-TTT achieves state-of-the-art PET denoising performance and exhibits superior generalization under challenging distribution shifts, including both unseen dose levels and unseen scanners. Our code will be available at this https URL.

[CV-13] An Uncertainty Estimation Framework for Dose Accumulation in Adaptive Radiotherapy: Application to CBCT-Guided Radiotherapy for Cervical Cancer

链接: https://arxiv.org/abs/2606.11012
作者: Cedric Hemon,Delphine Lebret,Jean-Claude Nunes,Valentin Boussot,Karine Peignaux,Nathalie Mesgouez-Nebout,Chantal Hanzen,Antoine Simon,Anaïs Barateau,Renaud de Crevoisier,Caroline Lafond
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under revision

点击查看摘要

Abstract:Background and purpose: oART enables daily plan adaptation to interfraction anatomical variations, but cumulative dose estimation remains limited by DIR, segmentation, and anatomical uncertainties. We introduce IMPACT-DoseAcc, an uncertainty-aware dose accumulation framework, within IMPACT for semantic feature-driven image analysis. The framework is modality- and disease-agnostic and is applied to CBCT-guided oART for cervical cancer (LACC). Material and Methods: Nine LACC patients were retrospectively analyzed using daily CBCT-derived virtual CTs for dose recalculation. IMPACT-DoseAcc focuses on uncertainty from DIR, without modeling vCT-generation uncertainty. Two DIR uncertainty strategies were tested within IMPACT-Reg: a Bayesian segmentation-guided approach using one probabilistic model to quantify anatomical uncertainty, and an ensemble of segmentation models targeting structures to capture epistemic variability. Voxel-wise uncertainty maps were propagated through dose warping and accumulation to generate probabilistic dose-volume histograms. Ensemble uncertainty was quantified from voxel-wise standard deviation across deformation fields, and geometric error was assessed using surface distance between warped and validated contours. Anatomical-variability weighting refined aggregation. Results: Ensemble DIR uncertainty correlated with geometric error, with Pearson coefficients of 0.63 for CTVt and 0.66 for bladder. For CTVt, pDVHs achieved 96.3 +/- 3.9% coverage, showing calibration of propagated uncertainty. Weighting stabilized estimates across fractions and organs. Conclusions: IMPACT-DoseAcc propagates registration-driven uncertainty to cumulative dose metrics, improving interpretation of accumulated dose under anatomical variations. Its 3DSlicer integration supports reproducible, uncertainty-informed ART workflows. Comments: Under revision Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.11012 [cs.CV] (or arXiv:2606.11012v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.11012 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Cédric Hémon Dr [view email] [v1] Tue, 9 Jun 2026 15:52:58 UTC (9,451 KB)

[CV-14] IPSM-Bench: A New Intermediate Phase Segmentation Benchmark in Microstructure Images of Zinc-Based Absorbable Biomaterials IJCAI2026

链接: https://arxiv.org/abs/2606.11001
作者: Jinglin Xu,Shangyan Zhao,Jiabo Wang,Xinghong Mu,Yulong Lei,Jiacheng Zhang,Hongbo Sun,Yageng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2026

点击查看摘要

Abstract:Zinc-based alloys are indispensable emerging absorbable metallic biomaterials, and their macroscopic performance is governed by microstructural characteristics. Intermediate phases-key microstructural constituents-are pivotal in regulating mechanical and functional properties. However, intermediate phase segmentation in zinc alloy microstructures faces formidable challenges: scarce annotated datasets, low contrast, difficulty detecting small targets, and heterogeneous morphologies. To this end, we construct IPSM-Bench, the largest high-quality dataset for zinc-alloy intermediate phase segmentation. Furthermore, we propose SCoP-SAM, a new Spatial Context Prior-guided SAM method that leverages the gradient structure and grayscale properties of intermediate phases to capture spatial context priors and incorporates them into the entire SAM encoding-decoding process, improving segmentation performance. Based on the proposed IPSM-Bench, we establish a new benchmark for intermediate phase segmentation to systematically evaluate state-of-the-art (SOTA) methods and advance research on zinc alloy microstructure analysis. Extensive experiments on IPSM-Bench and additional public alloy benchmarks demonstrate that our SCoP-SAM not only achieves SOTA performance for zinc-alloy intermediate phase segmentation but also generalizes remarkably well to other alloy scenarios.

[CV-15] AnimaSpark: A Feed-Forward Method for Animating Arbitrary 3D Objects

链接: https://arxiv.org/abs/2606.10988
作者: Yiming Zhao,Haoyu Sun,Aoyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:While recent advancements in generative AI have substantially accelerated static 3D model creation workflows, the synthesis of category-agnostic 3D animations remains a significant bottleneck in 3D asset production. Current methods for category-agnostic animation generation exhibit critical limitations in inference speed, motion quality, and adherence to textual prompts, thereby leaving the process dependent on labor-intensive manual artistry. To address these challenges, this paper introduces AnimaSpark, a novel pipeline for category-agnostic 3D animation generation. Our approach is motivated by the key insight that for many fundamental motions in the 3D world, the corresponding joint transformations can often be effectively modeled within a two-dimensional subspace. The pipeline begins by rendering a rigged static 3D model into multi-layered image representations of its mesh and skeleton, which are subsequently fed into a video generation model. We then employ a keypoint tracking algorithm on the generated video to capture the motion of the skeletal joints projected onto the camera’s viewing plane. In the final stage, we distill the planar translations and rotations from these tracked keypoints and lift them from the 2D domain into 3D space to animate the character. Comprehensive evaluations reveal that our method achieves superior performance over existing state-of-the-art techniques across key metrics, including text-motion alignment, quality of motion, and computational efficiency.

[CV-16] Quo Vadis Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

链接: https://arxiv.org/abs/2606.10967
作者: Pradnya Halady,Jiale Wei,Zdravko Marinov,Alexander Jaus,Simon Reiß
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on 14 datasets and 12 tasks (in total, we explore 106 dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

[CV-17] Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

链接: https://arxiv.org/abs/2606.10953
作者: Fedor Rodionov,Aleksandar Cvejic,Michael Birsak,John Femiani,Peter Wonka
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor-plan datasets with object-level furniture annotations. To address this gap, we introduce AntPlan-270, a curated dataset of 270 architectural floor plans with per-room furniture bounding box annotations across ten residential room categories. Building on this dataset, we present Architect-Ant, an editable automatic furnishing framework powered by a fine-tuned vision-language model. Furniture layouts are represented using a compact, coordinate-based domain-specific language (DSL) that encodes object categories and placements relative to the room geometry. To improve spatial reasoning, we generate procedural reasoning traces that capture architectural constraints such as wall alignment, door and window clearance, circulation, fixture compatibility, and room-specific furniture inventories, and use them to supervise fine-tuning of the model. We then apply preference optimization over candidate object placements to further refine layout quality. The generated DSL can be rasterized into semantic masks and used to condition a Flux-based LoRA renderer, producing realistic blueprint-style furnished floor-plan images while preserving the editable symbolic layout. Experiments on layout furnishing show that Architect-Ant produces geometrically valid and functionally plausible layouts, and suggest a scalable path for furnishing larger structure-only floor-plan datasets.

[CV-18] Democratising Camera Trap AI: An Open-Source Model for Detecting UK Mammals

链接: https://arxiv.org/abs/2606.10940
作者: Paul Fergus,Philip Stephens,Russell A. Hill,Lee Oliver,Katie Appleby,Sarah Beatham,Naomi Davies Walsh,Stuart Nixon,Naomi Matthews,Chris Sutherland,Kelly Hitchcock
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 Pages, 4 Figures

点击查看摘要

Abstract:Camera traps have become a cornerstone of biodiversity monitoring, but the artificial intelligence that turns vast quantities of images into usable ecological data is often locked behind commercial platforms or trained on fauna that does not match that of the British Isles. In an attempt to remove barriers and increase uptake, we release an open-source object detection model for 31 classes, 28 common UK mammal and bird species, plus utility classes for humans, calibration poles, and vehicles, drawn from a curated dataset of 48,165 labelled instances assembled from multiple sites over a decade of operational deployment through Conservation AI and its successor, Trap Tracker. The model, a YOLO26x detector trained and tested on an 80/10/10 class-stratified split, achieves a mean Average Precision of 0.984 at Intersection over Union (IoU) of 0.5 (0.956 at IoU 0.5-0.95) on the held-out validation set, with precision 0.988 and recall 0.965. On an unseen held-out test split, mean per-species confidence ranged from 0.96 to 0.99 across the 31 classes, with a 0.17% false-negative rate concentrated in difficult night-time, distant, or occluded images. These metrics are from data from the same pool of sites and cameras as training, so performance at entirely new sites is left to future work. We release the trained weights in ONNX format under a non-commercial licence, with local desktop and real-time camera support, aimed explicitly at ecologists with no machine-learning experience. This release is a deliberate counterweight to the multiple paid for models that have developed over the last decade.

[CV-19] PENet: A Lightweight Residual Transformer Framework for Efficient Image Steganalysis

链接: https://arxiv.org/abs/2606.10939
作者: Jincheol AN,Dongsu Kim,Haneol Jang,YoungJoon Yoo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE ACCESS

点击查看摘要

Abstract:Image steganalysis, the detection of hidden information embedded in digital images, is a core component of modern cybersecurity and digital forensics. Recent residual Transformer architectures, such as the Pixel-Difference-Convolution and Enhanced-Transformer-Network (PENet) [1], achieve strong detection accuracy, but their computational and memory demands hinder deployment in resource-constrained settings. We present PENet+, a lightweight steganalysis framework that preserves PENet’s discriminative structure while substantially improving efficiency. Rather than redesigning or compressing the attention blocks, we retain PENet’s self-attention topology for reproducibility and add a classifier-streamlining stage that progressively narrows the SPP-to-FC1 input channels (SPP: spatial pyramid pooling; FC1: first fully connected layer), yielding large reductions in parameters and FLOPs with negligible accuracy loss. We further refine the high-pass-filter (HPF) stem with an activation-aware mechanism that aggregates HPF responses early and selects a balanced SRM-Gabor top-K subset, and we replace PENet’s backbone with a MobileNetV2-style inverted residual network. A balanced configuration with K=31 filters (16 Gabor + 15 SRM) matches or surpasses heavier settings at lower compute. Finally, we motivate PReLU from a steganalysis standpoint, arguing that preserving negative responses helps capture weak stego cues that ReLU suppresses. On a disjoint ALASKA2 JPEG QF90 protocol at 512x512 resolution (5,000 cover images for training, validation, and internal testing; a separate 19,000-cover evaluation set), PENet+ achieves up to 45.5% fewer parameters and about 97% fewer FLOPs than the re-evaluated PENet baseline, offering a computationally efficient direction for resource-constrained steganalysis. Device-level latency and power measurements remain future work.

[CV-20] Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

链接: https://arxiv.org/abs/2606.10905
作者: Sunil Khatri,Steven Landgraf,Markus Ulrich,Simon Reiß
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can – based on a few examples – adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely 1 million parameters and a modest amount of 70,000 images. We compare the results of this severely capacity capped tiny model to 7,000\times larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.

[CV-21] Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

链接: https://arxiv.org/abs/2606.10902
作者: Xuan Han,Yihao Zhao,Mingyu You
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Subject Customization is a foundational task in modern image generation. By providing a few reference images and a text prompt, users can generate images of a specific object in any desired scene. However, existing methods still struggle to achieve effective pose control for customized subjects. In practice, they often exhibit inaccurate poses or inconsistent cross-pose appearances. These limitations suggest that understanding objects in a volumetric manner remains a significant challenge for 2D-native backbones. To address this challenge, we propose Pose-ICL, a tuning-free framework that leverages 3D-aware In-Context Learning (ICL) to directly adapt to new subjects through multiple paired image-pose references. Its core mechanism,Surface-Anchored Position Embedding (SAPE), equips the model with explicit 3D awareness by anchoring image tokens to the surface coordinates of a volumetric bounding box. Dedicated refinements ensure its seamless compatibility with existing DiT models. Extensive evaluations on both 3D assets and real-world subjects demonstrate that Pose-ICL significantly outperforms current methods in both pose accuracy and identity consistency.

[CV-22] he 1st PortraitCraft Challenge: A CVPR 2026 Workshop Competition on Portrait Composition Understanding and Generation

链接: https://arxiv.org/abs/2606.10894
作者: Zijie Lou,Youyun Tang,Xiaochao Qu,Haoxiang Li,Ting Liu,Luoqi Liu,Xun Zhu,Zheng Zhang,Xi Chen,Miao Li,Ji Wu,Dizhe Zhang,Xian Ge,Sujia Wang,Ruiyang Zhang,Jiaming Wang,Xianshun Wang,Lu Qi,Boao Kang,Wei Zhou,Jinghui Sun,Zhenyu Yan,Jiliang Zhao,Rui Yang,Yipo Huang,Boyuan Liu,Shanglin Li,Zifan Xie,Yichen Zhang,Anlan Wang,Wenfeng Lin,Mingyu Guo,Dong Li,Xinghao Wang,Yanting Li,Shanzhao Tong,Shuai He,Qiu Zhou,Yongqi Yang,Taoyang Mu,Dianqiao Lei,Anlong Ming,Huadong Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an overview of the inaugural PortraitCraft Challenge, held as one of the official competitions at CVPR 2026. The challenge focuses on portrait composition understanding and generation, aiming to advance AI research in portrait aesthetics analysis and controllable image synthesis. Unlike existing datasets and tasks that primarily focus on global aesthetic scoring, PortraitCraft introduces a unified evaluation framework comprising two complementary tracks. Track 1 requires models to perform structured portrait composition understanding, and Track 2 requires models to generate portrait images from structured composition descriptions under explicit compositional constraints. To support the challenge, we constructed and publicly released a large-scale portrait composition dataset consisting of approximately 50,000 curated real portrait images, providing multi-level supervision. This report describes the challenge setup, evaluation protocols, dataset composition, and final results, along with an analysis of the technical characteristics of the submitted solutions. The PortraitCraft Challenge provides a standardized and reproducible platform for research on portrait composition understanding and generation, and is expected to foster further progress in the fields of portrait aesthetics and controllable image generation.

[CV-23] Improving Text-Instance Alignment Of Foreground Conditioned Out-Painting Via Customized Concept Embedding

链接: https://arxiv.org/abs/2606.10892
作者: Yihao Zhao,Xuan Han,Bin He,Mingyu You
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To showcase products, merchants often incur substantial costs creating high-quality display images. Foreground Conditioned Outpainting (FCO) meets this demand, allowing users to create desired backgrounds for foreground instances at a low cost by adjusting the text prompt. However, existing text-driven FCO methods exhibit critical flaws in their outputs, most notably the presence of artifacts, which refer to regions in the synthesized background that share the same semantics as the foreground instance. Such artifacts diminish the object’s prominence and degrade image quality. We attribute the issue to the misalignment between the given instance and text-derived concept embeddings. To address this, we propose the Customized Concept Embedding Diffusion (CCE-Diffusion) framework. Its core is a CCE-Module to customize concept embeddings, bridging the gap between generic noun semantics and a specific visual instance. An Instance-Aware Loss guides the module’s optimization, while a Semantic-Preserving Prompt Template prevents customized embeddings from distorting other words in the prompt. Both qualitative and quantitative evaluations demonstrate that CCE-Diffusion significantly reduces artifacts in the outputs. As a plug-and-play component, the CCE-Module can integrate with various FCO methods, enhancing their performance.

[CV-24] Listen Look and Learn: Learning Without Forgetting through SAM-Audio

链接: https://arxiv.org/abs/2606.10887
作者: Avi Gupta,Nilotpal Sinha,Vishnu Raj,Sambuddha Saha,Pratik Joshi,Koteswar Rao Jerripothula,Tammam Tillo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Class-Incremental Learning (CIL) aims to continuously learn new classes without forgetting previously acquired knowledge. While recent CIL advances have spurred significant interest across various modalities, the audio-visual setting remains underexplored. Furthermore, although foundational multimodal models like SAM-Audio encapsulate rich static priors, our empirical analysis reveals that these representations struggle in incremental settings. This work bridges this gap by integrating SAM-Audio’s audio-visual priors into the CIL setting. Specifically, we leverage its dense audio and visual representations and employ a novel guided attention strategy where the audio features contextually guide the visual representations. To further mitigate catastrophic forgetting, we introduce dual-level distillation objectives at both the feature and logit levels. Extensive evaluations on audio-visual CIL benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods.

[CV-25] XtrAIn: Training-Guided Occlusion for Feature Attribution

链接: https://arxiv.org/abs/2606.10877
作者: Thodoris Lymperopoulos,Ioannis Kakogeorgiou,Denia Kanellopoulou
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures, 1 table

点击查看摘要

Abstract:Occlusion-based attribution methods provide an intuitive way to estimate feature importance by perturbing input features and measuring the resulting change in model output. However, their reliability is strongly affected by how feature removal is implemented: externally selected baselines can introduce bias, out-of-distribution samples, and unstable explanations, while in nonlinear models the occlusion of a set of features can also alter the contribution of non-occluded features. We refer to this effect as attribution shift, as the attribution scores of the non-occluded features drift from their initial values. To challenge these major issues that render explanations unstable, we introduce XtrAIn, a training-guided attribution method that transfers the occlusion operation from the input space to the parameter space. Instead of replacing input values with hand-crafted baselines, XtrAIn follows the model’s training trajectory and measures how feature-associated parameter updates affect the output logits. We further introduce Xstep, a lightweight approximation for reducing computational cost, and XtrAIn+, a target-focused variant that emphasizes updates aligned with the target class. Experiments on controlled image datasets and PAM50 breast-cancer subtype classification show that the proposed methods produce cleaner and more interpretable attribution patterns than standard attribution baselines. Overall, XtrAIn provides a training-aware perspective on feature attribution and offers a useful diagnostic tool for studying how feature-level evidence is formed during training.

[CV-26] Advancing Wood Identification in the Philippines: Utilizing the Xylorix Platform for Efficient AI Model Development and Deployment for Five Key Species

链接: https://arxiv.org/abs/2606.10876
作者: Rosalie C. Mendoza,Vivian C. Daracan,Arlene D. Romano,Ronniel D. Manalo,Xin Jie Tang,Yi Hong Wong,Yong Haur Tay
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Illegal logging and timber trade continue to pose significant challenges in the Philippines, where accurate wood species identification is essential for enforcement but limited by the need for specialised equipment and expertise. This study aims to evaluate whether AI models for macroscopic wood identification can be developed and deployed by wood scientists without programming expertise using the Xylorix platform, focusing on five Philippine hardwood species: Mangium (Acacia mangium Willd.), Rain Tree [Samanea saman (Jacq.) Merr.], Banuyo (Wallaceodendron celebicum Koord.), Tindalo [Afzelia rhomboidea (Blanco) Vidal], and Ipil [Intsia bijuga (Colebr.) O. Kuntze]. Binary classifiers were trained on 10,663 verified cross-section images from 260 specimens and evaluated using specimen-level mean scoring to mirror operational field conditions. Area Under the ROC Curve (AUC) values ranged from 0.969 (Ipil) to 1.000 (Mangium), and Average Precision (AP) values ranged from 0.589 (Samanea) to 1.000 (Mangium). Four of five species achieved AA grade (AUC and AP both \geq 0.90); Rain Tree received AE (AUC \geq 0.90, AP 0.60) due to AP compression from its small positive test set (3 specimens). All five classifiers rank their target specimens above non-target specimens with near-perfect fidelity. Specimen-level error analysis revealed 9 false negatives from Ipil, primarily stemming from localized image artifacts and 3 false positives for Rain Tree and 1 false positive for Tindalo caused by shared tribal-level anatomical traits. These findings demonstrate that Xylorix non-programmers can leverage the Xylorix platform to construct operationally reliable wood identification models suitable for field deployment at supply chain checkpoints.

[CV-27] Schmidt Decomposition-Based Methods for Efficient Quantum Image Encoding

链接: https://arxiv.org/abs/2606.10874
作者: Ana-Maria Pangeva,Yassine Ferhi,Alexander Geng,Andreas Weinmann,Desislava Ivanova,Ali Moghiseh
类目: Computer Vision and Pattern Recognition (cs.CV); Quantum Algebra (math.QA); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:In quantum image processing, a fundamental step is encoding classical image data into quantum states. This can be achieved using methods such as Flexible Representation of Quantum Images (FRQI), Quantum Probability Image Encoding (QPIE), and Novel Enhanced Quantum Representation (NEQR). However, on real quantum hardware, these encodings can quickly lead to circuits with many gates, large circuit depth, and high qubit usage, which is a problem for Noisy Intermediate-Scale Quantum (NISQ) devices. In this work, we investigate whether low-rank state approximation, formulated via Schmidt decomposition, can help reduce this complexity. The method keeps only the most significant parts of a quantum state’s entanglement structure, making state preparation more efficient while preserving most of the image information. We compare the three encoding techniques in their original form and with low-rank approximation, evaluating metrics such as circuit depth, CNOT count, MSE, and visual quality of reconstructed images. The results reveal meaningful trade-offs between accuracy and resource efficiency, with the FRQI model achieving a 97 percent reduction in circuit depth while maintaining a near-perfect reconstruction (MSE of about 0.27). This demonstrates the potential of low-rank techniques for advancing practical quantum image processing on near-term hardware.

[CV-28] LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

链接: https://arxiv.org/abs/2606.10862
作者: Taishan Li,Jiwen Zhang,Siyuan Wang,Xuanjing Huang,Zhongyu Wei
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textitscene-induced occlusion as a fundamental challenge for VLA models and introduce \textbfLIBERO-Occ, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbfViewpoint Imagination (VIM), which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \hrefthis https URLthis https URL.

[CV-29] HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

链接: https://arxiv.org/abs/2606.10839
作者: Cong Wang,Zhentao Yu,Hongmei Wang,Weicong Liang,Zixiang Zhou,Zilin Yang,Jiarong Ou,Rui Chen,Yuan Zhou,Qinglin Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three architectural refinements complemented by a staged training curriculum. Specifically, we first introduce Multi-level Feature Injection to anchor identity fidelity; by injecting raw ViT features from frontal references alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, we employ learnable proxy tokens to unify heterogeneous reference layouts across single-/multi-view settings while simultaneously resolving the reference-view mismatch problem. Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the original generative priors, we propose the Progressive View Curriculum. This four-stage training strategy employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity. Extensive evaluation on our multi-view benchmark, comprising 100 manually-curated cases spanning 52 unique identities, demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.

[CV-30] Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

链接: https://arxiv.org/abs/2606.10819
作者: Miaoxin Cai,Guanqun Wang,Wei Zhang,Guangyao Zhou,Yin Zhuang,Tong Zhang,Hao Wang,He Chen,Jun Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

[CV-31] IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

链接: https://arxiv.org/abs/2606.10818
作者: Jiawei Gao,Chaoqi Liu,Peilin Wu,Haonan Chen,Yilun Du
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previous learning-based approaches typically employ imitation learning policies that output target end-effector poses tracked by low-level impedance controllers. In these systems, forceful interactions are either implicitly realized through steady-state tracking errors or explicitly commanded using wrist force/torque or tactile sensors. However, implicit approaches generalize poorly across object weights, while explicit approaches require specialized hardware and increase system complexity. In this work, we propose IMPACT, a framework that decouples these forceful tasks into task-planning and internal-model-based predictive control. Extensive simulation and real-world experiments demonstrate that the proposed framework achieves higher success rates and improved generalization to unseen object weights, as well as better safety and energy efficiency.

[CV-32] Deep learning for echo sounder data

链接: https://arxiv.org/abs/2606.10811
作者: Ketil Malde
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There is no doubt that over the last decade, techniques from the field of machine learning have revolutionized how we process and interpret data, especially images and text. For underwater observations acoustics is a primary source of information, and naturally, deep learning methods have been applied to echograms and other acoustics data, but so far with rather modest results. Here, we argue that due to intrinsic properties of acoustic data, substantial advances will likely require research into deep learning methods beyond mere recycling of models and techniques from image processing. Currently, the potential for breakthroughs in method development is hindered by the lack of standard data formats and organization, and even more by the lack of readily available, high quality data sets with established performance goals. To advance the field, these shortcomings should be remedied

[CV-33] SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

链接: https://arxiv.org/abs/2606.10804
作者: Wenhao Yan,Fengjia Guo,Zhuoyi Yang,Jie Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, an framework that bypasses those intermediates and achieves \textbfend-to-end character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To archive the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: this https URL.

[CV-34] A Multimodal RGB and Events Dataset for Hand Detection in First-Person View

链接: https://arxiv.org/abs/2606.10790
作者: Bharghav Kota(1),Yulia Sandamirskaya(1) ((1) Zurich University of Applied Sciences, Wädenswil, Switzerland)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing hand detection algorithms work on images and the detection rate is restricted by the frame rate of the camera. In hand detection applications for moving robotic systems, conventional cameras cause motion blur, especially in darker lighting conditions. We can leverage the use of event-based cameras which possess a high dynamic range, high temporal resolution, and low power consumption. Recent work has shown that using a stereo setup of an event-based and a frame-based camera improves detection accuracy and the bandwidth-latency tradeoff. The main bottleneck in using event-based cameras in object detection and recognition tasks is a relatively low amount of training data. In this work, we propose a methodology and an exemplary synthetic event-based hand dataset from an egocentric, first-person view perspective. The data is synthesized from the existing RGB Egohands dataset with the v2e toolbox. Parameters of the v2e toolbox are varied to provide versions of the dataset with different lighting conditions and scales. Ground truth detections are generated with a fine-tuned YOLOv8 model which is applied to the RGB images in the Egohands dataset and interpolated on the high-temporal resolution events. We use the multi-modal dataset to perform hand detection with existing object detection algorithms which use a multi-modal setup of event and RGB cameras and demonstrate performance comparable to the state-of-the-art.

[CV-35] From Patches to Patients: A study of the tile-to-slide performance transferability in Digital Pathology MICCAI2026

链接: https://arxiv.org/abs/2606.10778
作者: Sofiène Boutaj,Leo Fillioux,Maria Vakalopoulou,Stergios Christodoulidis,Pierre Marza
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2026

点击查看摘要

Abstract:Foundation Models (FMs) have recently redefined the state-of-the-art in histopathology by providing robust representations for whole-slide image (WSI) analysis. However, selecting the optimal foundation model (FM) for a specific clinical cohort currently requires multiple preprocessing steps, followed by computationally expensive feature extraction and the training of a Multiple Instance Learning (MIL) aggregator for every model. In this work, we investigate whether efficient tile-level linear probing can serve as a reliable proxy for slide-level performance, reducing the need to run full slide-level pipelines for every candidate encoder. We benchmark 19 state-of-the-art FMs on 42 slide-level and 16 tile-level tasks, comparing tile probing metrics against slide-level outcomes using ABMIL and Mean Pooling aggregations. We observe a high correlation between tile and slide performance across varying task difficulties, indicating that encoder representation quality is the primary determinant of WSI success. Sensitivity analyses show that transferability is stable across models and is more influenced by cohort sizes and numbers of tiles per slide than by average task difficulty. We also measure the agreement in best performing models between tile and slide-level tasks, showing tile benchmarks reliably shortlist strong candidates. Overall, our study indicates that tile-level benchmarking provides an efficient and practical first step for narrowing down candidate models, while slide-level evaluation remains essential for final validation on clinical tasks.

[CV-36] Spatially Selective Self-Training for Unsupervised Building Change Detection

链接: https://arxiv.org/abs/2606.10775
作者: Wafaa I. M. Hussin,Zhi Lu,Anas M. I. Mohammed,Xiang Zhou,Ratiba A. H. Abubaker,Zhenming Peng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Unsupervised building change detection aims to learn building-change masks from unlabeled bi-temporal remote sensing images. Existing label-free methods often follow a discrepancy-to-mask paradigm, directly using temporal differences, frozen foundation-model responses, prompt-based outputs, or post-processing results as final change maps. Although these strategies provide annotation-free cues, they do not learn a task-specific building-change detector and remain vulnerable to the gap between generic temporal discrepancies and building-defined structural changes. In practice, such discrepancies are often noisy and task-irrelevant, as appearance shifts, registration errors, and non-building modifications can produce strong but misleading responses. To address this problem, we propose SST-CD, a spatially selective self-training framework that reformulates fully label-free building change detection as end-to-end detector learning under noisy pseudo supervision. SST-CD uses temporal discrepancies as candidate pseudo labels and trains the detector only on spatially reliable pixels, whose reliability is estimated by a local consistency criterion that filters inconsistent regions from supervision. To further stabilize noisy self-training, a lightweight feature adapter recalibrates bi-temporal features, while a prototype-based decoder produces compact change and no-change representations. Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show that SST-CD achieves F1 scores of 83.08%, 91.69%, and 86.60%, respectively, outperforming existing unsupervised and label-free baselines. Code will be made publicly available. Comments: Under Review Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.10775 [cs.CV] (or arXiv:2606.10775v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.10775 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-37] ZODS-RS – Zero-training Oriented Detection Segmentation for Remote Sensing

链接: https://arxiv.org/abs/2606.10769
作者: Zuan Gu,Tianhan Gao,Langxu Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote-sensing and UAV applications need models that generalize across platforms and viewpoints without task-specific training. Yet training-free pipelines often falter on oriented geometry, scale/rotation variation, and crowded ports or airfields, and rarely unify detection and segmentation. We introduce ZODS-RS, a training-free, closed-form pipeline that outputs horizontal boxes (HBB) and instance masks. Built on DINOv3 dense features and SAM-style proposals, ZODS-RS chains: PP (prototype purification via Tyler covariance), R-SEM (rotation-scale equivariant matching with separable kernels and global Hungarian assignment), and UAM (uncertainty-aware pixelwise merging with adaptive priors and optional negative prototypes). A lightweight CWLA fuses multiple DINOv3 layers. On FAIR1M (HBB) we obtain \mathrmmAP_0.50:0.95=\mathbf13.06 and \mathrmAP_S=\mathbf2.93 \emph(class-averaged over ship/airplane); on xView (HBB) we report \mathrmmAP=\mathbf16.69 . On our UAV dataset, ZODS-RS achieves mask \mathrmmIoU=\mathbf31.10 and improves small-object AP by \mathbf+30.70 over Grounded-SAM on a single 5090. This work offers a unified, \emphno-training solution for horizontal-box detection plus instance segmentation in aerial imagery; provides explicit closed-form formulations for PP/R-SEM/UAM tightly coupled with DINOv3; and demonstrates \emphconsistent gains on small and crowded targets and under cross-domain shifts while keeping deployment simple.

[CV-38] DD-INR: Dynamics-Driven Implicit Neural Representation for Accelerated Whole-Brain Functional MRI Reconstruction

链接: https://arxiv.org/abs/2606.10756
作者: Qiaoxin Li(MIND),Caini Pan(NEUROSPIN, MIND),Pierre-Antoine Comby(MIND, BAOBAB),Chaithya Giliyar(MIND),Philippe Ciuciu(MIND)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Accelerated acquisition of fMRI enables enhanced detection of neurovascular (BOLD) activity in the brain, but image reconstruction becomes challenging with high k-space undersampling: Task-evoked BOLD signals are small in magnitude, which traditional anatomical MRI reconstruction methods fail to recover, as they favor spatial accuracy over temporal fidelity. We present DD-INR, a Dynamics-Driven Implicit Neural Representation framework tailored for accelerated fMRI that benefits from incoherent time-varying sampling and a tailored spatiotemporal prior, outperforming traditional methods, demonstrated in simulation and in-vivo acquisition, both in terms of image quality and retrieval of activation patterns. DD-INR achieves this by splitting the fMRI data into a static background and a temporally varying dynamic component, representing only the dynamics with a dedicated INR, thereby focusing the model’s capacity on activation-relevant changes while remaining compact. In general, DD-INR provides a promising framework for accelerated fMRI reconstruction, with the potential to improve the sensitivity and robustness of fMRI studies within practical scan time limits. The source code is available at this https URL.

[CV-39] Patient-Level Diagnosis of Acute Myeloid Leukemia via Deep Learning Analysis of Bone Marrow Smear

链接: https://arxiv.org/abs/2606.10735
作者: Yuqi Ma,Tianyi Wang,Weihua Meng,Hongru Chen,Fajin Tao,Qunxian Lu,Lin An,Xiaodong Mo,Gen Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 4 figures

点击查看摘要

Abstract:Bone marrow smear review remains important for acute myeloid leukemia (AML) assessment, but manual single-cell interpretation is labor-intensive and patient-level diagnosis requires aggregation of many cellular observations. We present a cell-to-patient deep learning pipeline for AML-assisted diagnosis from bone marrow smear images. The study included 258 patients from six anonymized centers, including a main cohort of 169 patients from Centers 1-3 and an external validation cohort of 89 patients from Centers 4-6. A 16-category cell annotation vocabulary was used to describe the global cellular composition, including granulocytic, monocytic, erythroid, lymphoid, eosinophilic, and other cells. Rather than identifying strict AML blasts or leukemic blasts, the model targets an expert-defined composite category termed Composite Blast-like Cells (CBLC), comprising N, N1, M, M1, R, R1, J, and J1 according to the project-wide morphological standard. A fixed YOLO-based segmentation module detected cells, predicted contours were matched to expert polygon annotations by contour IoU, and standardized single-cell crops were generated. An EfficientNet-B0 classifier was trained through a two-stage GT-to-YOLO and YOLO-to-YOLO strategy with class-imbalance correction, center-border regularization, and morphology-assisted supervision. Cell-level predictions were aggregated into patient-level CBLC ratios for AML-oriented diagnostic support. The pipeline achieved stable internal validation and maintained external generalization, with ensemble weighted F1-scores of 0.9076, 0.8696, and 0.9124 on Centers 4, 5, and 6, respectively.

[CV-40] Vector Map as Language: Toward Unified Remote Sensing Vector Mapping

链接: https://arxiv.org/abs/2606.10701
作者: Yinglong Yan,Yunkai Yang,Haoyi Wang,Wei Fu,Linshan Wu,Honghu Pan,Shaobo Xia,Shanghang Zhang,Hao Chen,Leyuan Fang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing vector mapping aims to generate structured maps of geospatial entities, such as buildings, roads, and water bodies, from remote sensing imagery. In practice, vector maps usually contain multiple category layers and heterogeneous entity structures, requiring a unified model for diverse mapping needs. However, existing methods typically represent vector objects as polygons or graphs, making them suitable only for specific categories: polygons poorly capture topological relations, while graphs often blur instance boundaries. We observe that language, as a natural medium for human communication, offers a flexible and expressive representation that can accommodate heterogeneous map elements, including geometry, semantics, and topolog. Motivated by this insight, we propose Vector Map as Language (VecLang), a unified paradigm that reformulates multiclass vector mapping as structured text generation. VecLang encodes the common elements of different geospatial entities into a GeoJSON-like vector language, enabling cross-category modeling within a shared textual format. To generate this language reliably, we design a progressive vision-language mapping framework that first localizes vectorization units and then generates structured map elements. We further introduce Hierarchical Vector Language Optimization, which uses reinforcement learning to improve syntax validity, content fidelity, and map executability. We also build VecMap-Bench with 54K images and 800K instances, supporting training and evaluation across standard and generalization settings. Extensive experiments demonstrate that VecLang handles both single-class and multiclass vector mapping while achieving strong cross-dataset and open-vocabulary generalization. The model and dataset are publicly available at this https URL.

[CV-41] Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line

链接: https://arxiv.org/abs/2606.10699
作者: Amin Doroodchi,Danial Soleimany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the production process of network cables, ensuring the correct color sequence of wire pairs inside the standard connector plays a critical role in the final performance of the cable, as any misplacement or color-ordering error can lead to defective products and impose significant costs. Traditional inspection methods based on visual examination through digital microscopes are typically time-consuming, tedious, and prone to human error. In this study, an intelligent system based on the twelfth version of the YOLO1 object detection model was developed to identify the position and verify the correct color sequence of wires in patch cords. The dataset used consisted of 2,500 images captured from microscopic views of network connectors, which were divided into 70% for training, 15% for validation, and 15% for testing. The proposed model, leveraging a single-stage architecture and attention mechanisms during learning, achieved highly accurate wire detection with approximately 98% precision. Additionally, the overall mean accuracy, classification precision, and recall were around 95%, 99%, and 98%, respectively. The results demonstrate that this system can reliably and in real time verify the correctness of wire color sequencing on the production line without the need for human intervention, thereby reducing human error and enhancing efficiency in the manufacturing process.

[CV-42] Dont waste SAM

链接: https://arxiv.org/abs/2606.10696
作者: Nermeen Abou Baker,Uwe Handmann
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at European Symposium on Artificial Neural Networks (ESANN2023), Computational Intelligence and Machine Learning. Bruges (Belgium)

点击查看摘要

Abstract:Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

[CV-43] UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

链接: https://arxiv.org/abs/2606.10683
作者: Dong Fang,Youjun Wu,Yuanxin Zhong,Rui Zhang,Yunlong Wang,Xiaosong Jia,Yu-Gang Jiang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

[CV-44] FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

链接: https://arxiv.org/abs/2606.10671
作者: Yu Lu,Junjie Yang,Piotr Koniusz,YuXin Song,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.

[CV-45] Analyzing Training-Free Corruption Detection for Object Detection Datasets CVPR

链接: https://arxiv.org/abs/2606.10666
作者: Christian Sieberichs,Simon Geerkens,Thomas Waschulzik,Viswanathan Ramesh,Alexander Braun
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: Accepted at DataCV Workshop, Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Annotation errors are widespread in computer vision datasets and can significantly degrade the performance of systems trained on them, particularly in complex tasks such as object detection. Several approaches exist to identify annotation errors, including training-free feature-space methods which provide a fast and interpretable way to analyze annotations. However, the behavior on object detection annotations, which include semantic and spatial information, remains largely unexplored. In this work we analyze the applicability of feature-space-based approaches for detecting annotation errors in object detection datasets. By adapting an existing feature-space method, we show that such approaches reliably expose semantic mislabel, while positional errors remain difficult to detect. We evaluate this behavior across multiple pretrained embedding models, synthetic noise types (symmetric, asymmetric, and positional), and real-world annotation errors using VOC2012 and KITTI. All code and real-world corruptions are publicly available at the following repository: this https URL ChristianSieberichs/BoundingBox_corruption_detection Comments: Accepted at DataCV Workshop, Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB) Cite as: arXiv:2606.10666 [cs.CV] (or arXiv:2606.10666v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.10666 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-46] Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving

链接: https://arxiv.org/abs/2606.10656
作者: Qi Song,Yifei He,Chi Zhang,Zheng Fu,Xuhe Zhao,Mengmeng Yang,Kun Jiang,Rui Huang,Diange Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Forecasting the future evolution of dynamic scenes is crucial in autonomous driving. However, existing feed-forward paradigms are primarily designed for interpolation. When extended to future extrapolation, they suffer from ghosting artifacts under large displacements and are constrained by simplified motion assumptions or strict future priors. To overcome these challenges, we propose Envision4D, a fully self-supervised feed-forward framework for pose-free future extrapolation. Specifically, we introduce a Future Pose Prediction module that infers future camera parameters via an iterative denoising process. Furthermore, to capture non-linear dynamics, we propose In-layer Temporal Attention and employ Conditioned Motion Lifting, which transforms the highly uncertain extrapolation process into robust relational mappings. Finally, a Progressive Training Strategy is utilized to stabilize unsupervised motion learning against error accumulation. Extensive experiments demonstrate that Envision4D achieves state-of-the-art performance, significantly outperforming existing methods in future view synthesis.

[CV-47] STEDiff: Strengthening Text Embedding for Text-to-Image Alignment in Diffusion Model IJCNN2026

链接: https://arxiv.org/abs/2606.10653
作者: Hailan Zhang,Haipeng Liu,Bo Fu,Yang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, to appear at IJCNN 2026

点击查看摘要

Abstract:Although pretrained text-to-image (T2I) generation models can produce high-quality images, they often fail to faithfully reflect the semantic intent of complex prompts due to stochastic noise and inherent model limitations. This issue frequently manifests as the model overlooking specific objects or failing to correctly bind attributes to their corresponding entities, a challenge referred to as semantic alignment. Unlike existing approaches that rely on computationally expensive fine-tuning or labor-intensive layout priors, we propose STEDiff, a training-free method designed to enhance semantic representations directly within the text-embedding space. Specifically, we introduce a method that primarily leverages the [EOT] token to strengthen the relevant semantics of sub-sentences and then replaces the corresponding tokens in the original prompt. Furthermore, a novel semantic enhancement loss is incorporated to enforce spatial constraints, ensuring that the semantics of each entity are precisely mapped to their respective image regions. Extensive quantitative and qualitative evaluations on the T2I-CompBench demonstrate that our method notably improves semantic consistency and generation integrity in complex scenarios.

[CV-48] Kwai Keye-VL-2.0 Technical Report

链接: https://arxiv.org/abs/2606.10651
作者: Kwai Keye Team,Bin Wen,Changyi Liu,Chengru Song,Chongling Rao,Guowang Zhang,Han Li,Haonan Fan,Hengrui Ju,Jiankang Chen,Jiapeng Chen,Jiawei Yuan,Kaixuan Yang,Kaiyu Jiang,Kun Gai,Lingzhi Zhou,Na Nie,Sen Na,Tianke Zhang,Tingting Gao,Xuanyu Zheng,Yulong Chen,Fan Yang,Haixuan Gao,Lele Yang,Mingqiao Liu,Muxi Diao,Qi Zhang,Qile Su,Wei Chen,Wentao Hong,Xingyu Lu,Yancheng Long,Yankai Yang,Yingxin Li,Yiyang Fan,Yu Xia,Yuzhe Chen,Ziliang Lai,Chuan Yi,Haonan Jia,Tianming Liang,Weixin Xu,Xiaoxiao Ma,Yang Tian,Yufei Han,Feng Han,Hang Li,Jing Wang,Jinghui Jia,Junmin Chen,Junyu Shi,Ruilin Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 11 figures

点击查看摘要

Abstract:We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

[CV-49] ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting

链接: https://arxiv.org/abs/2606.10645
作者: Wenhao Hu,Haonan Zhou,Liu Liu,Yun Du,Xinjie Wang,Ziang Li,Zhizhong Su,Gaoang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic and interactive 3D scenes from real-world observations remains a fundamental challenge in computer vision and robotics. While recent advances in 3D Gaussian Splatting have enabled high-fidelity static reconstruction, extending it to interactive environments with articulated robots and manipulable objects remains difficult due to complex contact interactions and abrupt pose changes. To address these challenges, we introduce ManiSplat, a unified framework that reconstructs controllable and decoupled Gaussian digital twins directly from monocular ego-view robotic videos. Our method introduces a Graph-Structured Disentangled Representation that separates the robot, objects, and background into independently optimizable Gaussian subfields organized within a scene graph. To ensure stability, we propose a Task-Oriented Spatio-Temporal Alignment module that leverages the inherent logic of manipulation tasks-alternating between Motion and Skill phases-to construct accurate pseudo-ground-truth trajectories. Finally, a joint photometric-geometric optimization ensures the reconstructed scenes are temporally coherent, physically consistent, and simulation-ready. Extensive experiments demonstrate that our approach reconstructs interaction-driven dynamic scenes with high fidelity and controllability, effectively supporting downstream robotic tasks and policy learning.

[CV-50] ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

链接: https://arxiv.org/abs/2606.10640
作者: Hao Liu,Ruping Cao,Kun Wang,Zhiran Li,Fan Liu,Yupeng Hu,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this report, we present our champion solution for the DataMFM Challenge Track 2: Chart Understanding. This track requires models to recover structured chart data and generate faithful natural-language summaries from chart images. To address the complementary requirements of accurate data extraction and factual narration, we propose ChartLens, a dual-branch framework for chart data correction and summary refinement. ChartLens consists of two key modules: Structure-Aware CSV Verification and Correction (SAVC) and Text-Retention-Guided Summary Refinement (TRSR). SAVC improves the reliability of structured data extraction through verification and correction, while TRSR enhances summary generation by preserving critical textual and numerical evidence from charts. By combining model adaptation, correction-based generation, and OCR-assisted evidence grounding, ChartLens improves both structured data recovery and summary factuality. On the test set, our final system achieves an overall score of 69.10 and ranks first in Track 2, demonstrating its effectiveness for accurate chart understanding. Our code will be released at: this https URL.

[CV-51] Leverag ing Metric Depth for Relative Depth Prediction

链接: https://arxiv.org/abs/2606.10628
作者: Xiaoyang Bi,Shuaikun Liu,Zhaohong Liu,Yuxin Yang,Zhe Zhao,Mengshi Qi,Liang Liu,Huadong Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present our solution to the 2025 SoccerNet Monocular Depth Estimation Competition Challenge. Predicting the relative depth in football scenarios is challenging, especially with only thousands of training samples available. To address this issue, our method leverages the powerful zero-shot capabilities of models pretrained on large-scale datasets to learn metric depth for effective relative depth prediction, achieving a score of 2.68 \times 10^-3 on the challenge set.

[CV-52] Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

链接: https://arxiv.org/abs/2606.10620
作者: Xinrui Wu,Lichen Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.

[CV-53] SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models ICML2026

链接: https://arxiv.org/abs/2606.10617
作者: Zhengxuan Wei,Yi Dong,Zonghui Li,Xianhui Lin,Xing Liu,Hong Gu,Shaofeng Zhang,Wenbin Li,Qi Fan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) merging can efficiently combine diverse generative capabilities from multiple trained LoRAs for a diffusion model. However, existing LoRA merging techniques often suffer from severe parameter interference, causing destructive collisions in the shared parameter space. To address this, we propose Subspace Signal Routing (SSR), which resolves interference by routing internal signals instead of performing parameter-space merge. Specifically, SSR first constructs a unified subspace by concatenating candidate LoRAs along the rank dimension. Next, SSR employs an inverse correlation matrix to decorrelate mixed signals within this space. Finally, a directional guide matrix steers these purified signals into their respective task-specific subspaces. We provide a rigorous theoretical analysis proving that SSR aligns with the Ordinary Least Squares (OLS) solution, thereby ensuring mathematical optimality. We utilize the additivity of sufficient statistics to design a streaming algorithm. This enables on-the-fly updates that significantly reduce memory overhead and computation time. Extensive experiments validate that SSR significantly outperforms state-of-the-art methods while maintaining comparable efficiency. Code is available at this https URL.

[CV-54] Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

链接: https://arxiv.org/abs/2606.10614
作者: Beomjun Kim,Seong Hyeon Park,Seunghoon Sim,Seungjun Moon,Sanghyeok Lee,Jinwoo Shin
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

[CV-55] GaussTrace: Provenance Analysis of 3D Gaussian Splatting Models with Evidence-based LLM Reasoning ICML2026

链接: https://arxiv.org/abs/2606.10612
作者: Haoliang Han,Ziyuan Luo,Renjie Wan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is a powerful technique for creating high-fidelity 3D assets. However, the widespread sharing and iterative modification of 3DGS models across digital platforms create pressing challenges for intellectual property protection and forensic traceability. To address this, we propose GaussTrace, a novel framework for constructing directed provenance graphs for 3DGS models. GaussTrace formulates provenance analysis as an evidence-based reasoning problem. It builds upon attribute-wise statistical profiling of 3DGS parameters to capture intrinsic properties. Moreover, we introduce hypothesis-driven editing simulations of common operations to provide auxiliary evidence for plausible transformation pathways. These statistical and simulated cues jointly enable a Large Language Model (LLM) to perform structured Chain-of-Thought (CoT) reasoning, yielding directional provenance inferences and explainable edge reasons. Experimental results demonstrate that GaussTrace effectively constructs evolutionary relationships among diverse 3DGS models, delivering accurate, interpretable, and robust provenance graphs without requiring model training or access to editing histories. Project page: this https URL.

[CV-56] Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

链接: https://arxiv.org/abs/2606.10611
作者: Auguste Lehuger,Guillaume Henon-Just
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures, 5 tables. Under review at the European Workshop on Reinforcement Learning (EWRL)

点击查看摘要

Abstract:Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforcement Learning is uniquely positioned to overcome this bottleneck. By pairing an optimization policy with a geometry-aware neural encoder, an agent can automatically discover rich geometric priors directly from data, utilizing these learned intuitions to strategically guide exploration. To realize this, we introduce the Polygons Transformer (PoT), a novel architecture that encodes 2D continuous vector geometries while allowing cross-polygons attention. We couple this novel architecture with a Combinatorial Optimization Reinforcement Learning (CORL) training framework to find optimal solutions. To support this paradigm, we release an open-source training dataset derived from complex geographic contours alongside a dedicated evaluation benchmark. Our empirical validation demonstrates that our trained agent achieves area utilization performance highly competitive with Sparrow, the state-of-the-art heuristic solver, proving that reinforcement learning can successfully discover and exploit geometric awareness for precise spatial tasks.

[CV-57] Globally Localizing Lunar Rover in Pixels via Graph Alignment

链接: https://arxiv.org/abs/2606.10602
作者: Mao Chen,Xu Yang,Chuankai Liu,Xiangkai Zhang,Xiaoxue Wang,Zheng Bo,Zuoyu Zhang,Zhiyong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise rover localization is a prerequisite for autonomous lunar exploration, yet the absence of Global Navigation Satellite System (GNSS) signals and the cumulative drift of local localization methods severely constrain long-range missions. Cross-view localization provides a promising drift-free global solution by matching rover-view and satellite-view imagery. However, the lunar environment poses unique challenges for correspondence alignment, including inter-entity entanglement, inter-viewpoint divergence, and simulation-to-real domain shift. To address these challenges, we propose Warped Alignment of Reprojected Graphs (WARG), a framework that leverages unified graph learning and reprojected graph matching for robust cross-view alignment. Pretrained on the synthetic LuSNAR dataset, WARG achieves an average test error of 0.32 m and demonstrates robust zero-shot generalization to the synthetic lunar south pole region with an error of 3.63 m. More importantly, when validated on real-world data from the YuTu-2 rover, WARG achieves a localization error of 1.68 m within a 100 m x 100 m search area, corresponding to nearly one-pixel precision in low-resolution satellite imagery with a spatial resolution of 1.40 m/pixel. Beyond accuracy, WARG is computationally efficient, containing only 1.56M parameters, corresponding to 16.12% of previous lightweight models, and operating at 5.49 Hz on an NVIDIA RTX A6000 GPU, approaching GNSS-level update frequency. Finally, we observe that WARG naturally develops low-level spatial awareness, including semantic segmentation and structural reasoning, through cross-view localization learning, highlighting its potential as a promising paradigm for spatial intelligence with minimal annotation cost. The source code is available at this https URL.

[CV-58] Segment and Select: Vision-Language Segmentation in 3D Scenarios

链接: https://arxiv.org/abs/2606.10594
作者: Yulin Chen,Zhihang Zhong,Yuenan Hou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The core idea is to reformulate 3D vision-language segmentation as the segment-and-select paradigm (free from the superpoint dependency)

点击查看摘要

Abstract:3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computation complexity, which suffers from poor segmentation quality and messy object boundaries. In this paper, we propose the SEGment-And-select (SEGA3D) paradigm for 3D visionlanguage segmentation that directly operates on the fine-grained visual information and is free from the superpoint dependency. Specifically, we first leverage a mask candidate generator to provide fine-grained categorical mask candidates, substantially improving the quality of candidate masks over the superpoint counterparts. Then, a Large Language Model (LLM) is utilized to generate the semantic and spatial information based on the linguistic description and visual features. The LLM output and visual features are fed to the Semantic-Spatial Selector (SSS) to produce the top-ranking mask candidates. Eventually, the Loopback Verification Module (LVM) is designed to yield the segmentation mask from the selected candidate masks. Our SEGA3D attains competitive performance on ScanRefer, ScanNet and Matterport3D benchmarks. Notably, our SEGA3D surpasses the top-performing counterpart by 8.3 mIoU and 5.3 mIoU on ScanNet and Matterport3D, respectively. Codes will be available upon publication.

[CV-59] Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

链接: https://arxiv.org/abs/2606.10571
作者: Lijia Yu,Jiuxin Cao,Yuchen Qiang,Changhao Chen,Yifei Huang,Bo Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 17 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

[CV-60] PrismAvatar: Pseudo-Multiview Reconstruction and Subpixel Prism Rendering for Real-Time Stereoscopic Communication

链接: https://arxiv.org/abs/2606.10550
作者: Chufeng Fang,Dongdong Teng,Lilin Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Real-time stereoscopic video communication has long been a goal of immersive telepresence, yet practical systems still require specialized capture rigs or reduce remote users to a single portrait view. We present PrismAvatar, a Gaussian head-avatar system that connects monocular avatar capture with subpixel-encoded glasses-free lenticular display for real-time autostereoscopic communication. From a monocular portrait video, PrismAvatar reconstructs a controllable head avatar and optimizes it for the lateral viewing zones induced by the display. The method uses natural head turns as pseudo-multiview (PMV) supervision to constrain regions that are otherwise weakly observed in monocular training, including hair, ears, jaw contours, and neck boundaries. Reliable side frames are yaw-binned, aligned to virtual cameras, and supervised within a strict head-and-hair domain; contour-aware losses and staged regularization further suppress ghosting, alpha leakage, and depth instability while preserving lateral detail. At runtime, PrismAvatar renders 32 virtual views and encodes them into a 4K lenticular raster with calibrated subpixel-routing masks. The live-tracker prototype sustains 10.65 FPS, and a subject-specific distilled driver raises the same display pipeline to 38.49 FPS.

[CV-61] GRAR: Glass-induced Reflection Artifact Removal in LiDAR Point Clouds

链接: https://arxiv.org/abs/2606.10541
作者: Wanpeng Shao,Zeyi Guo,Bo Zhang,Yifei Xue,Tie Ji,Yizhen Lao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Terrestrial Laser Scanning (TLS) point clouds captured in urban environments frequently suffer from glass-induced reflection artifacts, severely degrading downstream applications. Existing reflection artifact removal methods generally rely on ideal reflection symmetry assumptions, yet their performance is limited by inaccurate glass estimation and insufficient geometric representations. To address these issues, we propose a novel unified framework aimed at robust reflection artifact removal: In the first stage, we leverage a multi-modal vision foundation model to produce initial glass masks, which are then refined using geometric cues to achieve high-precision glass regions, followed by glass completion to recover missing regions caused by no-return measurements on transparent surfaces; In the second stage, we propose a physics-driven descriptor, termed Reflection-aware Local-Global Geometric Similarity (RE-LGGS), which is grounded in actual laser reflection geometry and jointly encodes multi-scale geometric structures and orientation consistency using PCA-based local shape representations, thereby significantly improving robustness against imperfect observations. Extensive experiments on multiple public TLS datasets demonstrate that our framework consistently outperforms state-of-the-art methods in reflection artifacts removal.

[CV-62] Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning

链接: https://arxiv.org/abs/2606.10533
作者: Zihan Meng,Dexiang Hong,Weidong Chen,Ziyu Zhou,Bo Hu,Zhendong Mao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-visual captioning generates natural language descriptions from video and audio content. Multimodal LLMs have advanced this task, but both modalities contribute many tokens to the LLM input, where prefill self-attention scales quadratically. Existing token-pruning methods usually retain tokens by attention, saliency, or cross-entropy loss, yet the hard threshold selection makes it difficult to retain tokens that are truly valuable, especially for high-confusing tokens near the decision boundary. To this end, we propose a AVEX-Prune, an RL-based audio-visual dynamic token pruning method in this work. In our AVEX-Prune, an audio-visual token exchange strategy is proposed to select truly valuable tokens by replacing low-confidence retained tokens with high-confidence candidate tokens from the same or the other modality, and measuring the differences in caption generation from token swaps. AVEX-Prune preserves full-token quality at a 40% retention ratio on both VILA 1.5-8B (54.5 vs. 54.6) and VideoLLaMA 2 (57.0 vs. 56.8).

[CV-63] GUI-AC: Enhancing Continual Learning in GUI Agents

链接: https://arxiv.org/abs/2606.10522
作者: Can Lin,Tao Feng,Hangjie Yuan,Dan Zhang,Yifan Zhu,Zhonghong Ou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains unsolved. Notably, GUI data are inherently non-stationary: the continual emergence of previously unseen interface instances (e.g., novel domains and resolutions) induces persistent distribution shifts, significantly impeding the continual learning of existing GUI agents. Reinforcement fine-tuning (RFT) has attracted considerable attention as a promising approach. Nevertheless, RFT exhibits pronounced instability in its grounding capability, manifested as sharp reward discontinuities and high-variance oscillations. The imbalanced distribution of rollout outcomes introduces substantial noise into advantage estimation, leading to policy overconfidence. The fixed clipping bound suppresses the increase in policy probabilities needed to adapt to new distributions, leading to a collapse in exploration capacity. To address these challenges, we propose GUI-AC, a method that enhances the continual learning capability of GUI agents. GUI-AC introduces grounding certainty to support two core mechanisms: (i) Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence; and (ii) Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. Extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines. Code is available anonymously at this https URL.

[CV-64] LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching

链接: https://arxiv.org/abs/2606.10517
作者: Jiexi Lyu,Xizhou Bu,Qingqiu Huang,Chufeng Tang,Xiaoshuai Hao,Hongbo Wang,Wei Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning high-quality latent actions from large-scale unlabeled videos, coupled with limited real-world interaction data for training an action decoder, has emerged as a promising paradigm for scalable latent policy learning. However, existing approaches typically rely on behavior cloning, which tends to collapse inherently multimodal action distributions into unimodal ones, thereby degrading the pretrained latent action structure. While flow matching provides a potential alternative, directly applying it leads to a misalignment between latent actions and physical actions during action decoder training, due to the stochastic nature of the learned policy. To address these, we propose Latent Action Flow Policy (LAFP), which leverages flow matching for latent policy learning and introduces an inference-time interpolation mechanism to mitigate stochasticity-induced misalignment. Experimental results demonstrate that LAFP consistently outperforms prior methods on downstream imitation learning tasks, achieving up to 10-15% improvement in success rate while incurring less than 1x additional inference overhead.

[CV-65] PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

链接: https://arxiv.org/abs/2606.10492
作者: Haodong Lei,Hongsong Wang,Bingxuan Dai,Pan Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:The growing need for high-resolution image generation in autoregressive text-to-image models has resulted in extended token sequences, significantly increasing computational costs and inference times. However, existing state-of-the-art methods for accelerating autoregressive text-to-image models rely on chain-structured draft token sequences, leading to inefficient draft token search and limited acceptance lengths. To address this, we propose parallel-path cross-relaxed speculative Jacobi decoding (\textbfPathSpec), a novel framework that enhances efficiency through a multi-sequence draft tree structure. Our parallel-path speculative Jacobi decoding (\textbfPathExplore) expands the token search space, achieving a higher speedup ratio without sacrificing image quality. Additionally, we introduce cross-path relaxed verification (\textbfPathRelax) that exploits semantic similarities across sequences to further boost token acceptance rates. Evaluated on the Parti-Prompts, MSCOCO2017, and T2ICompBench datasets, our method achieves a speedup ratio of 4.14 \times , 3.95 \times , and 4.18 \times , respectively. Remarkably, PathExplore, without any relaxed sampling, outperforms relaxed sampling methods in the speedup ratio, such as GSD and LANTERN. Moreover, PathRelax’s relaxation mechanism can be seamlessly integrated with other relaxation techniques, enabling further acceleration and providing an efficient solution for real-time text-to-image generation. Our code is available at this https URL.

[CV-66] 5% 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2606.10488
作者: Yifan Zhu,Can Lin,Hangjie Yuan,Zixiang Zhao,Pengfei Zhang,Tao Feng,Zhonghong Ou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods provide a streamlined and efficient tool for adapting large models to domain-specific multimodal downstream tasks. Although these methods proved their tangible effects in practice, their principal aspects remain under-explored. Therefore we remain curious about the underlying generalization mechanisms in various PEFT methods and how they can be further enhanced. In this paper, we reveal the flatness preference widely present in various PEFTs, where a small fraction of sharp dimensions dominates the generalization of PEFT. This finding suggests an appealing possibility: we may be satisfied with a better generalization by merely attending to this small fraction of sharp dimensions instead of all of them. Furthermore, we propose Flatness Preference Optimization (FlatPO) to flatten these key sharpness dimensions, leading various PEFTs toward better generalization. Extensive experiments demonstrate the effectiveness of our findings and the proposed method. Code is available at this https URL.

[CV-67] 3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

链接: https://arxiv.org/abs/2606.10478
作者: Yuhao Wang,Puyi Wang,Linjie Li,Zhengyuan Yang,Kevin Qinghong Lin,Yu Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 24 pages, 11 figures

点击查看摘要

Abstract:Most recent 3D reconstruction and editing systems operate on implicit and explicit representations such as NeRF, point clouds, or meshes. While these representations enable high-fidelity rendering, they are fundamentally low-level and hard to control programmatically. In contrast, we propose and systematically evaluate a new 3D reconstruction paradigm, 3D Code Synthesis (3D-CoS), where 3D assets are constructed as executable Blender code, a programmatic and interpretable medium. To assess how well current VLMs can use code to represent 3D objects, we evaluate representative open-source and closed-source VLMs in code-based reconstruction under a unified protocol. We further introduce a suite of structured code-synthesis workflows, including blueprint-based planning, Retrieval-Augmented Generation (RAG) over Blender API documentation, few-shot geometric demonstrations, and a component-level Agent workflow for part-wise code generation. To demonstrate the unique advantages of this representation, we further evaluate localized text-driven modifications and compare our code-based edits with a point-cloud-based 3D editing baseline. Our study shows that code as a 3D representation offers strong controllability and locality, yielding stronger edit fidelity and better preservation of unedited regions in our targeted editing evaluation. Our work also analyzes the potential of this paradigm, delineates the current capability frontier of VLMs for programmatic 3D modeling, and highlights code synthesis as a promising direction for editable 3D reconstruction.

[CV-68] Geometric Coastline Localization using Vision-Language Models

链接: https://arxiv.org/abs/2606.10468
作者: Rafia Malik,Bernhard Pfahringer,Karin Bryan,Mark Dickson,Eibe Frank
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Coastline detection in remote sensing imagery is commonly formulated as a pixel-wise segmentation problem, where the final coastline is extracted from a predicted mask through post-processing. This formulation relegates coastline geometry, the primary representation used in coastal change analysis, to a secondary artifact rather than the learning objective. In practice, coastlines are defined by geomorphic proxies such as vegetation lines, dune toes, or cliff edges, rather than an instantaneous land-water boundary often used in pixel-based segmentation approaches. In this work, we revisit coastline extraction from a representation perspective and formulate the task as geometric boundary localization. We use the New Zealand Coastal Change Dataset (NZCCD) and high-resolution aerial imagery from Land Information New Zealand (LINZ) to develop CoastlineVLM-7B, a vision-language model (VLM) built on the GeoChat-7B/LLaVA-1.5 architecture that jointly performs coastline presence detection, proxy-type classification, and coastline grounding. The model directly predicts a coastline as a polyline rather than a dense segmentation mask. We evaluate CoastlineVLM-7B against segmentation baselines under strict one-pixel boundary supervision. Results show that geometry-based metrics are more suitable for assessing coastline localization quality than pixel-overlap metrics such as Intersection over Union (IoU). CoastlineVLM-7B improves global geometric alignment with reference coastlines, reducing Hausdorff distance from 37.74 m to 31.84 m and Earth Mover’s Distance from 21.12 m to 17.32 m. These results indicate that output representation is a critical design choice in coastline extraction, and that geometry-oriented learning, combined with the semantic reasoning capabilities of vision-language models, aligns well with how coastlines are defined and evaluated in operational coastal monitoring.

[CV-69] Few-step Generative Models as Lossy Compression

链接: https://arxiv.org/abs/2606.10450
作者: Fuma Kimishima,Jinjia Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:DiffC provides a principled way to reuse pre-trained diffusion models for lossy compression, but its encoding and decoding procedures remain slow because they require many discretized forward and reverse steps. We study whether few-step generative models – Rectified Flow, Consistency Trajectory Models (CTM), and MeanFlow – can be cast as codecs within the same reverse channel coding (RCC) framework. The main challenge is that RCC requires posterior and shared distribution parameters, whereas these models do not explicitly parameterize intermediate conditional distributions. For Rectified Flow and MeanFlow, we use the equivalence between velocity parameterization and diffusion-style denoising parameterization to derive the quantities required by RCC. For CTM, which is distilled from EDM, we adopt the EDM noise parameterization together with local Gaussian approximations of the sender and shared distributions at intermediate states. This yields a proof-of-concept probabilistic formulation that enables compression with pre-trained few-step generative models without retraining. On low-resolution benchmarks, the resulting codecs reduce encoding and decoding time and improve realism in the low-bit-rate regime.

[CV-70] Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

链接: https://arxiv.org/abs/2606.10431
作者: Shuangchun Gui,Zhiguang Cao,Wen Song,Yew-Soon Ong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by TNNLS

点击查看摘要

Abstract:Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These problems consist of multiple variants that optimize routing costs while meeting diverse customer constraints. Existing multi-task VRP solvers solely utilize a graph-based modality, limiting their ability to address variants with multiple constraints. As a format to represent complex semantics, vision modality shows great potential for encoding diverse VRP constraints. This motivates us to learn patch-level semantics from the vision images, and then integrate them into a graph-based model to solve various VRP variants simultaneously. However, directly applying this approach to multi-task VRPs presents three challenges: 1) existing VRP images lack constraint representations, which are essential for multi-task VRPs, 2) the fixed receptive field of individual patches cannot effectively accommodate varying requirements across tasks, and 3) imbalanced pixel distribution among constraints may cause the model to overlook constraints with fewer pixels. In this paper, we propose a vision-assisted foundation model (VaFM) to address these challenges. In the vision modality, input images tailored to all constraints are encoded by a convolutional neural network. The obtained patch embeddings are fused with graph-based nodes to generate solutions, with an auxiliary task designed to address the pixel-imbalanced issue. The performance of VaFM is evaluated across 16 different VRP variants. The experimental results demonstrate the superiority of VaFM over state-of-the-art methods, especially for variants with complex constraints.

[CV-71] me-frequency localization of bird calls in dense soundscapes

链接: https://arxiv.org/abs/2606.10407
作者: Simen Hexeberg,Fanghui Tong,Hari Vishnu,Mandar Chitre
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.

[CV-72] CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

链接: https://arxiv.org/abs/2606.10401
作者: Yiming Zhang,Ruoxuan Cao,Zhihang Zhong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

[CV-73] Efficient RWKV-based Representation Learning for 3D Point Clouds

链接: https://arxiv.org/abs/2606.10395
作者: Yun Liu,Xuefeng Yan,Liangliang Nan,Xianzhi Li,Peng Li,Zhe Zhu,Honghua Chen,Mingqiang Wei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent receptance weighted key value (RWKV) model combines RNN-style recurrence, offering a linear-complexity alternative to Transformers’ quadratic self-attention for modeling global dependencies. However, when directly applied to point clouds, RWKV, originally developed for sequential text, struggles to capture local geometric structures and model spatial dependencies effectively. To address this, we propose the \textbfP-RWKV block, which bridges the gap between sequence modeling and irregular 3D geometry while preserving the efficiency advantages of RWKV. It consists of a Local Perception Expansion (LPE) component to expand contextual perception along the spatio-temporal sequence and a Spatial Context Enhancement (SCE) component to strengthen spatial awareness. To validate the effectiveness of P-RWKV for point cloud understanding, we construct PointER, a single-modality self-supervised representation learning framework whose encoder is composed of stacked P-RWKV blocks. Furthermore, we extend P-RWKV to a cross-modality setting and integrate the proposed core sub-modules into multiple architectures, demonstrating strong plug-and-play flexibility and architectural generality. Extensive experiments show that the P-RWKV block and its key sub-modules achieve competitive performance across various tasks with lower computational cost and inference latency. Code will be released upon acceptance.

[CV-74] FSS-Net: Frequency-Spatial Synergy Network with Wavelet Attention for Carotid Artery Ultrasound Segmentation

链接: https://arxiv.org/abs/2606.10378
作者: Jiawei Liu,Zhijiang Wan,Junhua Hu,Rongli Zhang,Zhongbiao Xu,Yankun Cao,Yuan Chen,Jin Hong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of carotid arteries in ultrasound imaging is critical for stroke risk assessment. However, speckle noise, low contrast, and blurred boundaries remain major challenges. In this paper, we propose a Frequency-Spatial Synergy Network (FSS-Net) to achieve noise-robust and high-precision carotid artery segmentation. The network integrates wavelet transform, multi-domain attention, and edge enhancement into a unified encoder-decoder architecture. Specifically, a Channel-Spatial-Wavelet Attention (CSWA) module is designed to suppress noise and purify semantic features in the frequency domain. A Wavelet-Enhanced Bottleneck (WEB) module is introduced to capture long-range global dependencies efficiently. Furthermore, a Laplacian-Guided Adaptive Edge Fusion (LAEF) module compensates high-frequency details and maintains boundary continuity. Extensive experiments on carotid ultrasound datasets show that FSS-Net achieves a Dice score (DSC) of 96.46% and strong robustness under low SNR conditions, outperforming several state-of-the-art methods. This method realizes accurate segmentation of carotid artery in ultrasonic imaging, effectively identifies carotid atherosclerotic plaque, and is verified by other task (such as segmentation of breast cancer), suggesting that it has good clinical application potential in identifying abnormal tissue masses in ultrasonic images.

[CV-75] PF-Trans: Physics-Embedded Frequency-Aware Transformer for Spectral Reconstruction

链接: https://arxiv.org/abs/2606.10373
作者: Yuzhe Gui,Tianzhu Liu,Yanfeng Gu,Xian Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Snapshot Broadband Filter Array (BFA) imaging provides high light throughput for spectral reconstruction but introduces severe spectral aliasing due to complex modulation. Current deep learning approaches, limited to spatial denoising, often fail to address the global frequency-specific degradations caused by the mask structure. To address this, we propose a Physics-embedded Frequency-aware Transformer (PF-Trans) for high-fidelity remote sensing spectral reconstruction. Our method explicitly integrates the physical sensing model through mask injection and a gray-scale consistency loss to ensure physical fidelity. Furthermore, we introduce a Dual-domain Block with a parallel Fast Fourier Transform (FFT) branch, enabling the network to perceive and suppress aliasing artifacts in the frequency domain. Extensive experiments on multiple datasets demonstrate that PF-Trans achieves state-of-the-art performance, achieving a Peak Signal-to-Noise Ratio (PSNR) of up to 48.50 dB on the GF-5 Shanghai dataset, significantly outperforming comparison methods.

[CV-76] ClinReadNet: A clinical reading-inspired network for low-dose abdominal CT image quality assessment

链接: https://arxiv.org/abs/2606.10372
作者: Xianye Xiao,Yulong Zou,Yujie Luo,Taihui Yu,Cun-Jing Zheng,Yuan-ming Geng,Shuihua Wang,Yudong Zhang,Jin Hong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In abdominal CT imaging, developing a low-dose, no-reference image quality assessment (No-reference IQA) model that mimics doctors’ reading habits for evaluating CT image quality has significant practical value. This paper proposes a novel deep learning-based framework, ClinReadNet, whose design aligns with the clinical reading logic of radiologists: first, it introduces the Sobel ordinal quality network (SOQN) module, which can simultaneously focus on edge details highly relevant to image quality and the quality distribution pattern of the entire image, accurately matching the clinical image-reading judgment habit of “considering both local details and overall context”; second, the framework integrates the (shifted) window multi-scale temperature multi-head self-attention ((S)W-MTMSA) module, which further replicates the radiologists’ image-reading process of shifting from overall scanning to local focusing, and accurately locks in regions of interest through multi-sharpness attention; third, it designs the hierarchical ranked probability score (HRPS) loss function, which combines the dual logics of coarse classification and fine classification, while paying attention to the distance information between grading labels, effectively improving the performance of image quality assessment. Experiments conducted on the LDCTIQAG2023 dataset show that the proposed method achieves the current state-of-the-art (SOTA) performance: the values of Pearson’s linear correlation coefficient (PLCC), Spearman’s rank-order correlation coefficient (SROCC), and Kendall’s rank-order correlation coefficient (KROCC) reach 0.9507, 0.9554, and 0.8629 respectively, with the sum of their absolute values (Score) being 2.7690, outperforming existing methods.

[CV-77] Benchmarking stereo reconstruction for 3D printable Martian terrain models CVPR

链接: https://arxiv.org/abs/2606.10364
作者: Josephine Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, CVPR End-to-End 3D Workshop 2026

点击查看摘要

Abstract:Reconstructing printable 3D models from Mars rover imagery is challenging because Martian terrain is low-texture, irregular, and partially observed. We evaluate a pipeline that estimates stereo depth from NASA Curiosity images, completes geometry, and exports watertight OBJ meshes. On Middlebury, RAFT-Stereo outperforms semi-global block matching (SGBM), reducing disparity MAE from 3.22px to 0.73px and increasing valid prediction coverage from 76.3% to 100.0%. On Curiosity imagery, however, RAFT’s denser disparities show weaker edge alignment and higher photometric reprojection error, suggesting that benchmark accuracy does not directly transfer to Martian terrain reconstruction. Geometry completion demonstrates a tradeoff between local fidelity and global connectivity. We find that alpha shapes preserve accurate but fragmented structure, Poisson reconstruction produces more coherent meshes but adds unsupported surfaces, and a deterministic diffusion-fill baseline is intermediate but sensitive to stereo quality. Overall, standard stereo and completion methods can produce printable approximations of Martian terrain, but reliable reconstruction requires stronger domain-specific validation.

[CV-78] Multi-Angular Reflectance Anisotropy Observed from UAV Multispectral Imagery

链接: https://arxiv.org/abs/2606.10350
作者: Zhenqiang Qin,Chenguang Dai,Min Wang,Xian Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:UAV multispectral imagery naturally contains multi-angular observations due to low flight altitude and wide field-of-view imaging, which may introduce geometry-driven radiometric variability. This study proposes a geometry-aware multi-angular observation extraction workflow to quantify observation-geometry effects from a BRDF perspective. Specifically, camera intrinsics and extrinsics are refined via structure-from-motion (SFM), and homogeneous regions annotated on an orthomosaic are reprojected onto multiple raw sub-images acquired from different viewpoints. This enables joint extraction of multi-band reflectance and observation geometry parameters for the same ground targets under varying viewing directions. The extracted observations are further analyzed using band-wise polar visualization in the (VZA, RAA) domain. Results on a grassland target show clear reflectance anisotropy across ten bands, with red-edge and nearinfrared bands exhibiting 119-137% variability between maximum and minimum reflectance, indicating non-negligible observation-geometry effects on radiometric consistency.

[CV-79] Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

链接: https://arxiv.org/abs/2606.10329
作者: Yunlong Liu,Zekai Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious economic losses. Change detection (CD) can be applied to post-earthquake damage assessment as it can infer destroyed change regions from multi-temporal remote sensing images. Furthermore, the CD with short imaging interval will better satisfy the needs of the emergency rescues after earthquakes. However, the capability of current methods built on deep neural networks is limited because the dataset with short imaging interval is absent. To meet post-disaster immediate relief, we create a CD dataset, Turkey earthquake CD dataset (TUE-CD), for the evaluation of building damage in the short term after an earthquake. Because of the short acquisition interval of the post-event images, the imaging angle is different for different temporal images, which leads to some side-looking problems. To deal with these challenges, we present a multi-scale feature interaction network (MSI-Net) for efficient interaction between bi-temporal features, as well as mitigating the effect of side-looking problems. Specifically, the proposed MSI-Net consists of joint cross-attention (JCA) modules, multi-scale offset calibration (MOC) modules, and feature integration (FeI) modules. The JCA module unifies channel cross-attention and spatial joint attention for sufficient feature interaction. The MOC module further estimates the offsets to align the bi-temporal image with the multi-scale features. Finally, calibrated features and multi-scale features are fused by FeI modules for the prediction of changed areas. Experiments on the WHU-CD, CLCD, and the constructed TUE-CD dataset indicate that the proposed MSI-Net provides better results than considered state-of-the-art CD methods.

[CV-80] Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

链接: https://arxiv.org/abs/2606.10328
作者: Yunlong Liu,Zekai Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing methods cannot efficiently suppress the influences of spatial and spectral differences in unchanged areas. To address these issues, in this paper we propose a content-guided spatial-spectral integration network (CSI-Net) for the fusion of global spatial details and spectral difference information. Specifically, the proposed CSI-Net is composed of a spatial reasoning (SR) module, a spectral difference (SD) module, and a content-guided integration (CGI) module. In the SR module, the spatial information is learned by cascaded graph convolution blocks for global modeling. The SD module is responsible for the extraction of spectral features, by calculating the means and variances of features to reduce the impact of spectral differences in unchanged regions. In addition, in order to integrate the spatial-spectral features efficiently, we design a CGI module to further take advantage of their complementary information. In this module, high-level content information is introduced as a guide for a proper interaction. Due to the efficient spatial-spectral fusion, the proposed CSI-Net can learn the changed features better while achieving a suppression of spectral differences. Experimental results on LEVIR-CD, WHU-CD, and CLCD datasets demonstrate that the proposed CSI-Net produces better performance compared to state-of-the-art methods, and is applicable to different scenarios

[CV-81] Dissect and Prune: Enhancing Robustness in AI-Generated Image Detection ICML2026

链接: https://arxiv.org/abs/2606.10309
作者: Dahye Kim,Jaehyun Choi,Hyun Seok Seong,Seongho Kim,Donghun Lee,Sungwon Yi,Jang-Ho Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 9 figures, 9 tables, Accepted to ICML 2026; includes appendix

点击查看摘要

Abstract:While existing AI-generated image detectors report high performance, we identify that this is largely driven by a critical prediction asymmetry: a bias toward the real class that severely limits sensitivity to generated content, especially under standard post-processing operations such as compression and resizing. We hypothesize that this stems from the model’s reliance on spurious features, distracting signals that obscure true generative artifacts. To address this, we propose DEAR (Dissect and Prune), which leverages inpainted images to identify and prune these interfering components. Specifically, we find that features strongly aligned to either inpainted or non-inpainted regions are less robust to post-processing. By measuring the alignment between channel activations and inpaint masks, DEAR removes features at both extremes, retaining only those that capture genuine generative artifacts. Experimental results demonstrate that our approach significantly enhances robustness against unseen generators and post-processing, effectively mitigating the prediction asymmetry. Our code is available at this https URL.

[CV-82] FoA-SR: Faithful or Aesthetic? Profile-Aware Preference Optimization for Real-World Image Super-Resolution

链接: https://arxiv.org/abs/2606.10275
作者: Amjad Mahdi Alqarni,Peizhong Ju
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures, 9 tables. Preprint

点击查看摘要

Abstract:Real-world image super-resolution (SR) is often designed with a single restoration objective, despite the current capacity of generative models to produce multiple high-quality reconstructions for the same input. In this paper, we argue that the best restoration strategy is subject to the specific restoration profile: a Faithful restoration prioritizes reference consistency, structure preservation, and hallucination suppression, whereas an Aesthetic restoration prioritizes visually pleasing and natural-looking details. We propose FoA-SR, a novel preference optimization approach to real-world SR based on profiles. To achieve this goal, FoA-SR starts with our supervised FLUX.2-based SR adapter (Flux2SR) trained with LR latent conditioning, flow matching, and image-space reconstruction losses for paired LR-to-HR image super-resolution. Following the development of the shared supervised super-resolution adapter, FoA-SR generates a shared stochastic candidate pool for each input image and ranks the same candidates using profile-specific Faithful and Aesthetic rewards to mine winner-loser pairs. These pairs are used to fine-tune separate LoRA adapters while keeping the base model frozen. Experiments on RealSR and DIV2K show that FoA-SR can steer the same SR adapter towards distinct restoration objectives: a Faithful adapter improves reference-consistent metrics while an Aesthetic adapter boosts metrics that measure perceptual quality without reference. Our candidate-pool analysis shows that Faithful and Aesthetic rewards frequently select different winners, and a Hybrid-LoRA ablation shows that collapsing both profiles into one reward yields an implicit compromise rather than explicit profile control.

[CV-83] Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing

链接: https://arxiv.org/abs/2606.10223
作者: Awais Khan,Kutub Uddin,Khalid Malik
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6% ID accuracy, 4.9% EERc, and an 83.5% relative FPR95 reduction over the Interspeech 2025 baseline.

[CV-84] An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

链接: https://arxiv.org/abs/2606.10200
作者: Ahmed Faizul Haque,S.M. Riaz Rahman Antu,Saif Ahmed,Asadullah Hil Galib,Souvik Pramanik,Mohammad Ashrafuzzaman Khan,Mohammad Abdul Qayum,Mohsin Sajjad
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 9 figures

点击查看摘要

Abstract:An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

[CV-85] Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

链接: https://arxiv.org/abs/2606.10198
作者: Nina I. Shamsi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy, EigenScore) avoid labels but plateau in quality, while supervised probes (SAPLMA) attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, SAR, EigenScore, SAPLMA, and log-probability on seven QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using nine text and vision LLMs in a deliberately label-scarce protocol ( n_\textcal=200 queries, N=5 generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

[CV-86] Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning

链接: https://arxiv.org/abs/2606.10196
作者: Ghodsiyeh Rostami,Po-Han Chen,Mahdi S. Hosseini
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) aims to adapt pretrained models with a small trainable parameter subset, however, most existing methods choose this subset from fixed architectural heuristics rather than using dynamic, task-aware criteria. We introduce \textbfFisherAdapTune, a Fisher-guided Adaptive Fine-Tuning framework that progressively selects parameter groups by tracking the temporal drift of their Fisher geometry. Starting from a PAC-Bayesian view of fine-tuning, we decompose the generalization error bound into Fisher-weighted update costs and show that parameter groups whose curvature contribution has stabilized can be frozen to reduce the error bound without interrupting the remaining adaptation dynamics. FisherAdapTune formulates this criterion with a scale-invariant Jensen-Shannon distance between consecutive Fisher distributions, yielding an adaptive active parameter set. We evaluate our approach on a downstream segmentation task, and results show FisherAdapTune improves the in-distribution performance and zero-shot transfer in multiple settings, validating that Fisher structural drift is a useful signal for efficient, task-aware adaptation. We release our \hrefthis https URLcode publicly to enable further application of our proposed approach.

[CV-87] Making Time Editable in Video Diffusion Transformers

链接: https://arxiv.org/abs/2606.10183
作者: Konstantin Kuklev,Viacheslav Vasilev,Alexander Kunitsyn,Andrei Ivaniuta,Denis Dimitrov
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Modern Diffusion Transformers for video generation provide limited control over the progression of time and the editing of temporal dynamics. We propose a temporal-control methodology that extends a pretrained DiT with explicit time editing, allowing control over motion speed and temporal structure without redesigning the backbone. Its core implementation augments the pretrained model with a lightweight temporal module, preserving the original generative prior while expanding its controllable dynamic range.

[CV-88] A Large Scale Open-Source Image and Video Dataset for Robust Wildfire Detection and Classification

链接: https://arxiv.org/abs/2606.10174
作者: Emadeldeen Hamdan,Yingyi Luo,B. Ugur Toreyin,Erdem Koyuncu,Adam J. Watts,Ugur Gudukbay,Ahmet Enis Cetin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wildfire detection and monitoring are critical for mitigating fire spread and reducing environmental and infrastructural damage. In this work, we introduce GWFP (Global Wildfire Prevention Dataset), a large-scale, open-source dataset of wildfire images and videos designed to support early fire and smoke detection research. GWFP contains geographically diverse wildfire scenes, including flames, smoke, Waterdog/Fog environmental conditions, Near Infrared (NIR) imagery, Ember, and challenging negative samples collected from real-world scenarios worldwide. To evaluate dataset robustness and cross-domain generalization, we benchmark multiple convolutional and transformer-based architectures across both in-domain and cross-dataset settings. Additionally, we explore lightweight frequency–spatial feature interaction using Hadamard-enhanced residual connections (HTE-ResNet) to analyze representation robustness under domain-shift conditions. Experimental results demonstrate strong cross-dataset generalization and practical utility for real-world wildfire monitoring applications. The dataset and source code will be publicly released upon acceptance.

[CV-89] FlexPath: Learned Semantic Path Priors for Image-Based Planning

链接: https://arxiv.org/abs/2606.10167
作者: Taehyoung Kim,Tim Schoenbrod,David Eckel,Henri Meeß
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent learning-based path planners use neural networks to process visual map representations and approximate heuristics for classical search algorithms, yielding near-optimal paths with reduced search effort. However, these methods are tied to the shortest-path objective implicit in their supervision, which limits their flexibility to accommodate alternative criteria. We introduce FlexPath, a two-stage framework that decouples feasibility from preference. In Stage 1, we use imitation learning to acquire a task-independent spatial prior over feasible paths from visual map inputs. In Stage 2, differentiable Path Shape Objectives (PSOs) adapt this prior toward task-specific criteria without relearning path structure, requiring only efficient objective-level adaptation. A single pretrained model can be adapted to multiple objectives. For shortest-path planning, FlexPath reduces search effort on TMP by 14.3% compared to the state-of-the-art TransPath, while also finding lower-cost paths on average and demonstrating strong zero-shot generalization across three unseen domains. For obstacle clearance with minimum clearance distance 2, it achieves 96.8% full obstacle avoidance while maintaining low search cost. The framework further extends to semantic-aware avoidance and waypoint guidance via objective-level adaptation, and remains compatible with classical planners at inference time. Data and code are available at this https URL.

[CV-90] Fusing Satellite Imagery and Planimetric Maps for Cross-View Localization

链接: https://arxiv.org/abs/2606.10166
作者: Quang Long Ho Ngo,Zimin Xia,Alexandre Alahi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current cross-view localization methods predominantly rely on satellite imagery as the aerial modality. Although recent work explores planimetric maps (e.g., OpenStreetMap tiles), these approaches often lag in performance. Yet both modalities are widely available and possess complementary properties. Satellite images are closer to ground-level camera imagery, offering finer detail, whereas planimetric maps contain annotated objects (e.g., streetlamps) and remain informative in areas where the ground is occluded, such as by foliage. Despite this, only one prior work provides an end-to-end method to fuse the two modalities, and it does not demonstrate their potential within state-of-the-art methods. To combine the strengths of both modalities, we propose a new fusion module that augments standard encoders and demonstrates that integrating satellite imagery with planimetric maps improves state-of-the-art single-modality methods. The module comprises (i) cross-modal conditioning, which processes each modality’s encoding with awareness of the other, and (ii) a patch-level fusion rule that controls the granularity of information exchange. We achieve state-of-the-art results, reducing the mean localization error by 30.13%. Qualitatively, the fusion adaptively selects the more informative modality, improving overall accuracy.

[CV-91] DB-3DME: From Dataset to Benchmark for Human-aligned Automatic 3D Mesh Evaluation CVPR2026

链接: https://arxiv.org/abs/2606.10142
作者: Nanshan Jia,Zhenyu Zhao,Sui Huang,Jingshen Wang,Zeyu Zheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 workshop paper. 10 pages, 3 figures, 6 tables. Dataset available at GitHub and Hugging Face

点击查看摘要

Abstract:Recent advances in 3D generation have led to substantial improvements in realism, controllability, and efficiency, yet the evaluation of 3D assets remains underexplored. Existing evaluation paradigms, including human evaluation, learned metrics, and vision-language models (VLMs) as judges, suffer from limitations in cost, scalability, resolution handling, or task-specific alignment. In this work, we focus on 3D mesh evaluation and introduce DB-3DME, the Dataset and Benchmark for 3D Mesh Evaluation. DB-3DME contains 2,619 synthetic 3D meshes paired with human ratings on Geometry and Prompt Adherence. Using this dataset, we systematically benchmark state-of-the-art VLMs and identify visual encoding of 3D representations as a key factor for human-aligned evaluation performance. Motivated by this finding, we fine-tune an open-weight VLM, Qwen-2.5-VL-7B, for 3D mesh evaluation by adapting the visual encoder while freezing the language model. The fine-tuned model substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions, establishing a new benchmark for automatic 3D mesh evaluation. We publicly release the benchmark dataset on GitHub and Hugging Face to facilitate future research.

[CV-92] SAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

链接: https://arxiv.org/abs/2606.10136
作者: Osmar Luiz Ferreira de Carvalho,Osmar Abilio de Carvalho Junior,Anesmar Olino de Albuquerque,Daniel Guerreiro e Silva
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 47 pages, 8 tables, 6 figures

点击查看摘要

Abstract:Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model’s predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1–100x, pseudo-labels across thresholds 0.90–0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

[CV-93] BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

链接: https://arxiv.org/abs/2606.10135
作者: Shaohao Rui,Xiaofeng Mao,Zhanyu Zhang,Peijia Lin,Yansong Zhu,Yibo Zhang,Haibin Wan,Weijie Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD’s mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

[CV-94] Improving PET/CT-Based Whole-Body Lesion Segmentation Using Prediction Uncertainty-Augmented Models

链接: https://arxiv.org/abs/2606.10115
作者: Bashirul Azam Biswas,Biratal Raj Wagle,Zhihan Yang,Marc A. Seltzer,Matthew E. Maeder,James B. Yu,Indrani Bhattacharya
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Accurate lesion segmentation from whole-body Positron Emission Tomography (PET)/Computed Tomography (CT) scans is essential for cancer staging and treatment planning. PET provides functional metabolic information with different radiotracers, while CT offers anatomical localization. Lesion delineation from PET/CT imaging is clinically challenging due to subtle imaging features, confounders, and inter-reader variability. Existing deep learning approaches suffer from training-related stochasticity, inconsistent predictions, missed lesions in high tumor-burden cases, and lack uncertainty quantification, limiting their clinical reliability. Using nnU-Net as a baseline, we propose an uncertainty-aware framework for whole-body PET/CT lesion segmentation that integrates (1) Bayesian ensembling to reduce training stochasticity, (2) voxel-wise uncertainty quantification with epistemic and aleatoric decomposition, and (3) epistemic uncertainty-augmented training to improve lesion detection. Two public datasets, AutoPET-III (1,611 scans) and Deep-PSMA (200 scans), comprising FDG and PSMA studies across multiple cancer types, are used for training and evaluation. Bayesian ensembling improves robustness and performance over deterministic nnU-Net models on the unseen AutoPET-III test set. Uncertainty maps highlight regions of model disagreement and correlate with misclassifications, particularly false positives. Uncertainty-augmented training improves lesion recovery at the cost of increased FPVol, reflecting a precision-recall trade-off. A case-adaptive routing strategy further improves Dice by selecting between the base and augmented models. To our knowledge, this is the first study to systematically investigate uncertainty quantification in multi-tracer, pan-cancer PET/CT segmentation and to combine Bayesian ensembling with uncertainty-aware modeling for this task.

[CV-95] Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching

链接: https://arxiv.org/abs/2606.10107
作者: Kaden Stillwagon,Alexandra D. VandeLoo,Craig R. Forest
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Reliable evaluation of instance segmentation models requires metrics that accurately and consistently reflect segmentation quality. However, the metrics most widely used in biological imaging carry fundamental mathematical weaknesses: hard Intersection-over-Union (IoU) thresholds that produce discontinuous, low sensitivity scoring; per-object normalization that distorts scores under object size variation; and greedy or one-to-many matching procedures that yield non-optimal, order-dependent correspondences. Together, these properties produce unintuitive and unreliable model rankings under common failure modes such as split cells, merged cells, and cell boundary imprecision. We propose Maximum Matching Accuracy (MMA), a threshold-free continuous metric that finds a globally optimal one-to-one matching between predicted and ground truth objects and aggregates total overlap using per-pixel normalization. We evaluate MMA against AP@50, PQ, SEG, and AJI across three experiments: synthetic failure cases, progressive corruption tests, and a model ranking comparison. MMA produces scores that are more stable, more sensitive, and more interpretable than existing alternatives, providing a principled foundation for fair instance segmentation benchmarking in biological cell imaging.

[CV-96] Interpretable Temporal Facial-Region Motion Analysis for In-the-Wild Parkinsons Disease Video Classification

链接: https://arxiv.org/abs/2606.10088
作者: Riyadh Almushrafy(Majmaah University, Saudi Arabia)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 6 figures. Submitted to Biomedical Signal Processing and Control

点击查看摘要

Abstract:Reduced facial expressivity is a common motor manifestation of Parkinson’s disease (PD), often described as hypomimia or facial bradykinesia. This paper examines whether temporal motion descriptors extracted from facial-region keypoints can support in-the-wild PD-related video classification on the YouTubePD benchmark. Each video is represented using geometric descriptors from 14 predefined facial regions. Static geometry, normalized geometry, velocity-based descriptors, relative-velocity descriptors, and a GRU sequence baseline are compared under the same binary classification protocol. To assess stability and interpretability, the study includes seed-robustness analysis, region-level ablation, and permutation importance. The best result is obtained with normalized velocity descriptors and a Random Forest classifier, reaching a balanced accuracy of 0.826 and an AUROC of 0.855 on the held-out test split. Across 10 random seeds, this representation remains stable, with balanced accuracy of 0.810 +/- 0.018 and AUROC of 0.855 +/- 0.005. Overall, the results suggest that normalized facial-region motion is a lightweight and interpretable representation for YouTubePD video classification. The study is framed as a benchmark-level analysis and does not claim clinical severity assessment or MDS-UPDRS facial-expression scoring.

[CV-97] A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

链接: https://arxiv.org/abs/2606.10066
作者: Bruce Changlong Xu,Lan Wu,Alexander Ryu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages, 7 figures, 9 tables. Preprint

点击查看摘要

Abstract:Medical vision-language models (VLMs) are evaluated on public benchmarks whose images and question-answer pairs have been freely downloadable for years, yet reported accuracy assumes these examples were absent from pretraining. We audit open VLMs on SLAKE-En, PathVQA, VQA-RAD, and an auxiliary public OmniMedVQA mirror using four detector families: image-side near-neighbour overlap against PMC-OA-beta, canonical-order exchangeability, cohort-relative Min-K%++ tail enrichment, and cross-model top-K overlap. We find measurable image-side source overlap on SLAKE-En: 19.8% of images are flagged under SigLIP-B-16 and 4.2% under SigLIP-SO400M, while out-of-domain controls produce 0/2000 flags. Manual adjudication shows same-modality, same-projection matches to different patients rather than verified pixel-level duplicates, so we interpret this as source or distributional overlap rather than confirmed per-image memorization. On the text side, Qwen2.5-VL on SLAKE-En shows a canonical-order exchangeability signal that survives ordering ablation and external non-medical baselines. On the OmniMedVQA mirror, exchangeability fires for five medical and general VLMs while BLIP-2 remains clean. In contrast, cohort-relative Min-K%++ tail enrichment and cross-model top-K overlap collapse under an external pre-domain baseline: BLIP-2 reproduces the apparent positive signals despite lacking plausible medical-VQA exposure. We conclude that these cohort-relative detectors are unreliable as standalone membership-inference signals on small medical-VLM cohorts.

[CV-98] Continuous Neural Reparameterization as a Deep Geometric Prior for Robust Fixed-Chart UV Repair

链接: https://arxiv.org/abs/2606.10050
作者: Mohammad Sadegh Salehi
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional UV unwrapping relies on direct optimization of geometric distortion energies and can fail through invalid initialization, local minima, or topological foldovers. We recast fixed-chart UV unwrapping as continuous neural reparameterization: an untrained SIREN maps per-vertex mesh features to UV coordinates, and its weights are optimized for a geometric objective. The practical contribution is a robust chart-solver recipe, combining Laplace–Beltrami spectral inputs, Tutte residual warm-up, a C^2 determinant extension, an injectivity barrier, and validity-checked retry/fallback routing, rather than a claim that any single component guarantees validity or that recutting methods should be replaced. NTK–LBO diagnostics show that spectral conditioning changes update geometry, especially at initialization and mid-rank subspaces, but does not by itself predict chart success. On compact pre-cut charts and a 47-chart stratified Thingi10K/xatlas-cut benchmark, the neural solver produces zero flips on all compact charts and 42/47 valid zero-flip stratified solves. BFF and OptCuts comparisons sharpen the scope: recutting can be faster and lower-distortion when allowed, while the neural solver targets supplied-chart validity and validation-first atlas construction. On Amara Spatial generated meshes, the full atlas construction path gives packed-atlas coverage on a 25-asset set and 1000/1000 strict locally valid atlases with zero UV flips in a large-scale Rust atlas run after fallback routing.

[CV-99] GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

链接: https://arxiv.org/abs/2606.10025
作者: Sriram Krishna,Ben Eisner,Haotian Zhan,Ying Yuan,Haoyu Zhen,Chuang Gan,Shubham Tulsiani,David Held
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at RSS 2026

点击查看摘要

Abstract:We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations. Comments: Accepted at RSS 2026 Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2606.10025 [cs.RO] (or arXiv:2606.10025v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.10025 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-100] SpineReport: Automated 3D Quantification and Reporting of Lumbar Spine Degeneration on MRI

链接: https://arxiv.org/abs/2606.10021
作者: Nathan Molinier,Adrian A. Marth,Reto Sutter,Christoph Germann,Jacob A. Connolly,Mathieu Guay-Paquet,Nathan D. Schilaty,Kenneth A. Weber II,Julien Cohen-Adad
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Medical Image Analysis

点击查看摘要

Abstract:Lumbar spine conditions are a leading cause of disability worldwide, yet reliable quantification of degeneration from MRI remains challenging. In clinical practice, analysis is predominantly performed in two dimensions (2D), as manual three-dimensional (3D) assessment is time-consuming. However, 2D measurements suffer from limited reproducibility, particularly when anatomical structures are not aligned with the imaging plane. Existing automated approaches are often restricted to 2D, rely on discrete grading, or lack robustness and interpretability. We introduce SpineReport, an open-source, fully automated framework for comprehensive 3D morphometric analysis of lumbar spine MRI. Leveraging robust anatomical segmentations, the method extracts quantitative metrics from key structures, including the spinal canal, spinal cord, vertebrae, intervertebral discs, and foramina. These include both morphological and signal-based features, enabling cross-subject and longitudinal assessment. SpineReport further generates subject-specific reports that allow comparison with cohort distributions, improving interpretability and objective characterization of spinal morphology. Clinical relevance was evaluated against radiologist-reported severity grades for central canal, lateral recess, and foraminal stenosis. Metrics showed strong associations with central canal stenosis severity, with T2-weighted CSF signal providing the highest performance (AUC = 0.95). Canal AP diameter and area ratios also demonstrated strong correlations and high discriminative ability (AUC 0.80). For lateral recess stenosis, associations were moderate, with lateral CSF signal being the most informative (AUC = 0.73). No significant associations were observed for foraminal stenosis despite robust region-of-interest extraction. SpineReport is released as an open-access tool: this https URL

[CV-101] Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

链接: https://arxiv.org/abs/2606.10019
作者: Ray Zhang,Marcus Greiff,Thomas Lew,John Subosits
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of 55% in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.

[CV-102] ABot-Earth 0.5: Generative 3D Earth Model

链接: https://arxiv.org/abs/2606.09967
作者: Ming Qian,Tianjian Ouyang,Mingchao Sun,Zijian Wang,Jincheng Xiong,Jiarong Han,Yongchang Zhang,Jiawei Zhang,Xu Wang,Yu Liu,Luyang Tang,Fei Yu,Zengye Ge,Mengmeng Du,Yuan Liu,Nianfei Fan,Song Wang,Yingliang Peng,Chunxue Jia,Yang Liu,Shiying Zeng,Haozhe Shi,Junnan Lai,Hongyu Pan,Zheng Wu,Ning Guo,Mu Xu,Hang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: From Amap-cvlab, Alibaba. Official page: this https URL

点击查看摘要

Abstract:We present ABot-Earth 0.5, a generative 3D framework designed to synthesize vast, seamless 3D environments from ubiquitous, geospatially referenced satellite imagery. To achieve this, we propose a novel generative model formulated directly with the 3D Gaussian Splatting (3DGS) representation. The model is trained on a diverse corpus of existing real-world urban reconstructions, learning to generate realistic geometry and textures. At inference, it synthesizes novel 3D scenes conditioned solely on satellite imagery at a scalable rate of under 10 minutes per square kilometer, while demonstrating exceptional realism. The framework is designed for accessibility, with integrated hierarchical level-of-detail (LOD) structures that permit real-time, interactive visualization on web-based map engines. This high-fidelity simulation sandbox effectively mitigates the sim-to-real domain gap, enabling critical downstream Embodied AI applications like closed-loop UAV navigation. By providing an ultra-low-cost and high-efficiency solution, ABot-Earth 0.5 significantly lowers the technical and financial barriers to large-scale 3D reconstruction and empowers the future of global digital earth visualization.

[CV-103] SPARX: Secure and Privacy-Aware Approximate CNN Acceleration with Edge RISC-V SoC

链接: https://arxiv.org/abs/2606.09946
作者: Sonu Kumar,Akash Sankhe,Mukul Lokhande,Santosh Kumar Vishvakarma
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review in 12th International Symposium on Smart Electronic Systems (iSES) 2026

点击查看摘要

Abstract:Edge-AI systems increasingly require real-time CNN inference under strict energy, performance, security, and privacy constraints. Approximate computing improves hardware efficiency by exploiting the error resilience of neural network workloads; however, most approximate CNN accelerators do not jointly consider secure, privacy-aware edge deployment. This paper presents SPARX, a Secure and Privacy-Aware Approximate CNN Acceleration framework integrated within a heterogeneous RV32IMC RISC-V System-on-Chip (SoC). SPARX combines a custom RISC-V instruction extension, an approximate logarithmic CNN acceleration unit, a lightweight differential-noise-based privacy engine, and a challenge-response authentication mechanism. To guide arithmetic selection, an approximation-aware decision framework is introduced that uses the Approximation Severity Index (ASI), Approximation Efficiency (AE), Quality of Approximation (QoA), Approximation Figure-of-Merit (AFOM), and Hardware Acceleration Efficiency (HAE). Evaluation across 11 state-of-the-art approximate MAC architectures identifies the Iterative Logarithmic Multiplier (ILM) as the most suitable design, achieving 51.7% area reduction, 81.5% power reduction, and 2.13x throughput improvement compared with an accurate radix-4 Booth MAC, while only reducing ResNet-20/CIFAR-10 accuracy by 2.82 percentage points. FPGA implementation on a Xilinx VC707 platform achieves 58.4 GOPS/W energy efficiency at 250 MHz, while 28-nm CMOS physical implementation validates ASIC feasibility

[CV-104] Bypassing Copyright Protection in Diffusion-based Customization via Two-Stage Latent Feature Optimization KDD2026

链接: https://arxiv.org/abs/2606.09909
作者: Ziang Xu,Wenbo Yu,Hongyao Yu,Hao Fang,Jiawei Kong,Bin Chen,Hao Wu,Shu-Tao Xia,Zhiyong Wu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by KDD 2026

点击查看摘要

Abstract:With the growing concerns over copyright infringement in diffusion-based customization, adversarial attacks have emerged as a prominent defense strategy to prevent malicious content forgery in personalized image generation. However, current defenses typically introduce persistent perturbations in the latent space of Latent Diffusion Models (LDMs), which remain susceptible to adaptive bypasses by adversaries. In this paper, we introduce Two-Stage Latent Feature Optimization (TS-LFO), an efficient and effective copyright-stealing attack against protected diffusion-based customization. We begin by observing that existing defenses primarily disrupt the mapping between input images and their latent representations, thereby degrading the model’s ability to produce personalized outputs. To counteract this, TS-LFO restores the broken mapping through a two-stage optimization process. In the Latent Denoising Stage, we enhance semantic consistency between latent codes and input images by jointly minimizing a Latent-Image Alignment Loss and a Latent Diffusion Loss with timestep-dependent weights, effectively suppressing the high-frequency noise introduced by defenses. In the Latent Reconstruction Stage, we recover low-frequency semantic information using pixel-level constraints to refine the latent features. Extensive experiments show that TS-LFO consistently bypasses state-of-the-art (SOTA) copyright defenses and outperforms SOTA copyright attacks such as DiffPure, GrIDPure and IMPRESS across diverse settings.

[CV-105] WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory

链接: https://arxiv.org/abs/2606.09882
作者: Chong Liu,Luxuan Fu,Xuyu Feng,Zhen Dong,Bisheng Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The paradigm of digital twin cities is shifting from coarse visual mapping toward more precise and actionable digitization of urban assets. However, existing datasets predominantly focus on coarse visual perception, lacking the strict multi-modal alignment and attribute and status diagnosis required for automated infrastructure maintenance. To bridge this gap, we introduce WHU-Infra3D, a large-scale, multi-modal benchmark dataset dedicated to roadside infrastructure inventory. Covering 53.8 km across three cities, WHU-Infra3D uniquely integrates panoramic imagery and LiDAR point clouds with rigorous 2D-3D instance association and cross-frame tracking. Comprising over 175k multi-view 2D bounding boxes alongside thousands of 3D infrastructure instances, the dataset provides over 181k detailed attribute and status annotations (e.g., rust, occlusion) to empower operational health assessment. We establish comprehensive baselines across five core tasks: 2D detection, 2D cross-view matching, 3D geo-identification, 3D point cloud segmentation, and attribute recognition. Extensive evaluations expose significant cross-city domain gaps and inherent vulnerabilities of current models on long-tailed defective statuses, establishing WHU-Infra3D as an essential testbed for advancing scalable, AI-driven urban infrastructure inventory and lifecycle management. The WHU-Infra3D dataset is available at this https URL.

[CV-106] oward Calibrated Fair and accurate Deepfake Detection

链接: https://arxiv.org/abs/2606.09881
作者: Ryan Brown,Chris Russell
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake detectors show large performance gaps across demographic groups. Existing fairness approaches require demographic labels, retraining, or sacrifice accuracy. We introduce Face-Fairness (FF), a plug-and-play framework for bias mitigation. Our primary contribution, Face-Feature Tuning (FFT), is the first demographic label-free fairness method demonstrated for deepfake detection: a lightweight calibrator that performs a logit remapping conditioned on frozen face embeddings. We complement FFT with two variants: FF-Max, which maximizes worst-group accuracy when demographics are available, and FF-Discover, which does the same with embedding-discovered groups. Across in-domain and cross-dataset test settings, FF consistently reduces FPR/TPR gaps and improves minimum group accuracy while maintaining (often improving) overall accuracy. The approach is detector-agnostic, adds negligible runtime overhead, and requires no access to identity attributes.

[CV-107] SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

链接: https://arxiv.org/abs/2606.09871
作者: Hyunwoong Kim,Seongeun Lee,Hannah Yun,Junhyun Park,Jonggwon Park
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

[CV-108] MinhwaNet: Faithful but Insufficient Object Grounding in Korean Folk Painting

链接: https://arxiv.org/abs/2606.09855
作者: Joonhyung Bae
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Korean folk painting (minhwa) is built from a small vocabulary of auspicious symbols, a tiger for protection, a pair of birds for marital harmony, a peony for wealth, that recur across many of its painted genres. This suggests an obvious computational approach, identify which symbols appear in a painting and read the genre from the inventory. Working with a public corpus that pairs whole paintings, eight-field bilingual curatorial captions, and a separate set of expert object crops, we find that this approach does not work. A model given only a list of which symbols a painting contains predicts the genre far worse than a model that fuses the image with the curatorial text, and forcing the genre representation to be object-grounded actively hurts accuracy. The visual evidence on which the genre prediction rests is nonetheless localized and inspectable. A leakage-safe object evidence map projected from a part-level detector is spatially faithful to where curators isolated symbolic objects and to a patch-based surrogate’s own gradient saliency. We name this configuration a faithful-but-insufficient dissociation. The part-level explanation is honest about what the part-level model sees, yet the genre target turns on how symbols are arranged rather than on which ones appear. The same lens separates a content label that survives transfer to held-out source institutions, genre, from a style label that does not, era, a prediction we confirm on two further labels in the corpus. We release the multimodal system, a worked-example reading of one painting’s evidence map against its catalogue, and a set of evaluation cautions that recur in long-tailed heritage collections.

[CV-109] SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

链接: https://arxiv.org/abs/2605.29662
作者: Shilin Ma,Chubin Zhang,Changyuan Wang,Yuji Wang,Yue Wu,Zixuan Wang,Jingqi Tian,Zheng Zhu,Yansong Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on shallow-layer cues and risk discarding visual information required by deep layers. To address this issue, we propose SAFE-Pruner, a plug-and-play pruning framework that incorporates attention cues of future layers into pruning decisions. Specifically, we identify semantic attention consistency, the tendency that VLA models concentrate their attention probability mass on the same semantic entity across execution steps. Based on this observation, we design a forward-looking strategy to forecast the token saliency in deep layers, which prevents the premature removal of critical tokens and leads to more stable acceleration. We further introduce an adaptive subtask division strategy to detect abrupt attention shifts, thereby improving forecasting accuracy and pruning reliability. Extensive experiments in simulation and real-world settings demonstrate that our method achieves up to 1.89x speedup with a minimal degradation in success rate of less than 1.7%, while outperforming state-of-the-art methods by up to 1.9%.

[CV-110] Multimodal Brain Tumour Classification Using Feature Fusion

链接: https://arxiv.org/abs/2606.11107
作者: Wajih ul Islam,Muhammad Yaqoob,Javed Ali Khan,Volker Steuber
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT images alone, failing to replicate the clinicians multimodal reasoning. We explore a two-branch multimodal network combining raw MRI scans with 91 extracted radiomic features (intensity, texture, shape, and boundary descriptors) to classify brain tumors into glioma, meningioma, pituitary, and no-tumor. A pre-trained CNN backbone encodes the image stream, whereas a dedicated MLP encodes the radiomic stream. Both streams are fused via concatenation, gated, or bidirectional cross-modal attention strategies. Across nine experimental runs on a balanced 7,200 image dataset, all multimodal configurations outperform unimodal baselines with gated fusion achieving the best accuracy of 96.13%.

[CV-111] nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

链接: https://arxiv.org/abs/2606.10713
作者: Ana Sofia Santos,André Ferreira,Gijs Luijten,Naida Solak,Lisle Faray de Paiva,Behrus Hinrichs-Puladi,Jens Kleesiek,Jan Egger,Victor Alves
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 1 figure, 2 tables

点击查看摘要

Abstract:The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: this https URL

[CV-112] Overlapped Wavelet Diffusion for Low-Light Image Enhancement

链接: https://arxiv.org/abs/2606.10280
作者: Fen Peng,Taizo Suzuki,Seisuke Kyochi
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Advance published in IEICE Transactions on Information and Systems. DOI: https://doi.org/10.1587/transinf.2026PCP0006 . Code: this https URL

点击查看摘要

Abstract:In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.

[CV-113] POPSICLE: Benchmark Datasets for Segmentation and Localization in CryoET

链接: https://arxiv.org/abs/2606.10255
作者: Jonathan Schwartz,Utz Heinrich Ermel,C. Braxton Owens,Zhuowen Zhao,Ariana Peck,Gus L.W. Hart,Grant J. Jensen,Bridget Carragher,Dari Kimanius
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
备注:

点击查看摘要

Abstract:Cryo-electron tomography (cryoET) has emerged as a powerful tool in structural and cellular biology by enabling direct visualization of macromolecular structures within intact cells, thereby linking molecular architecture to cellular organization in a native context. Realizing the full potential of cryoET, however, increasingly depends on advances in computational analysis, particularly machine learning (ML), to interpret its complex and information-rich data. Despite rapid progress, ML development for cryoET remains bottlenecked by the lack of standardized, well-annotated benchmarks. Existing evaluations are typically small, task-specific, and are assembled in isolation, limiting robust comparisons across methods. Here, we present POPSICLE, a benchmark suite for cryoET segmentation and macromolecular localization built from the CryoET Data Portal - an open, ML-ready repository of tomographic data, metadata, and annotations. POPSICLE spans eukaryotic and prokaryotic systems, both purified and fully in situ samples, and dense voxel-wise segmentation as well as sparse localization tasks. Built on a living data resource, it can expand as new datasets and annotations become available. Baseline experiments reveal substantial variation in model rankings across tasks, underscoring the need for benchmarks tailored to the unique characteristics of cryoET rather than evaluation practices adapted from adjacent biomedical imaging domains. POPSICLE thus provides an open and extensible foundation for reproducible ML evaluation in cryoET.

人工智能

[AI-0] EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

链接: https://arxiv.org/abs/2606.11182
作者: Weixian Xu,Shilong Liu,Mengdi Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.

[AI-1] he Role of Feedback Alignment in Self-Distillation ICML2026

链接: https://arxiv.org/abs/2606.11173
作者: Semih Kara,Oğuzhan Ersoy
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the ICML 2026 Workshop on RL from World Feedback (RLxF)

点击查看摘要

Abstract:Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model’s output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver’s reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver’s reasoning is a key driver of self-distillation effectiveness. Comments: Accepted to the ICML 2026 Workshop on RL from World Feedback (RLxF) Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.11173 [cs.AI] (or arXiv:2606.11173v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.11173 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] Piper: A Programmable Distributed Training System

链接: https://arxiv.org/abs/2606.11169
作者: Megan Frisella,Shubham Tiwari,Andy Ruan,Yi Pan,Parker Gustafson,Mat Jacob,Gilbert Bernstein,Stephanie Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism, together with memory-saving optimizations like ZeRO. Deployed systems for foundation model pretraining often rely on human experts to manually design a high-level parallelism strategy then implement the corresponding low-level execution strategy, making it difficult to adapt the system to new strategies. Meanwhile, many general-purpose frameworks are more flexible but their implementations are still tied to a fixed set of common parallelism strategies, making it challenging to integrate state-of-the-art strategies. We present Piper, a user-controllable distributed training system that decouples the strategy from the runtime implementation. Piper allows users to declare a comprehensive distributed training strategy with a small set of model annotations and scheduling directives. Each directive applies a transformation on Piper’s intermediate representation (IR), a unified global training DAG that represents all computation and communication. Using this IR, Piper compiles per-device execution plans and executes them with a distributed runtime agnostic to the strategy. We show that the combined system maintains performance parity on commonly available strategies such as ZeRO, while also enabling additional performance and memory efficiency gains through joint scheduling of compute and communication in composed parallelism strategies such as DeepSeek-V3’s DualPipe. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.11169 [cs.DC] (or arXiv:2606.11169v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.11169 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] Reason Alloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

链接: https://arxiv.org/abs/2606.11164
作者: Wenhao Liu,Hao Shi,Yunhe Li,Weizhi Fei,Xiangyuan Wang,Mengzhe Ruan,Hanxu Hou,Peisong Wang,Linqi Song,Shuang Qiu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textitReasoning Wave’', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.

[AI-4] ABC-Bench: An Agent ic Bio-Capabilities Benchmark for Biosecurity ICML2026

链接: https://arxiv.org/abs/2606.11150
作者: Andrew Bo Liu,Samira Nedungadi,Bryce Cai,Alex Kleinman,Harmon Bhasin,Seth Donoughe
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages. To be published in ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well-documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet-lab validation experiments, we found that OpenAI’s o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.

[AI-5] owards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDA ISCA2026

链接: https://arxiv.org/abs/2606.11117
作者: Vinamra Sharma,Xingjian Fu,Jude Haris,José Cano
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

点击查看摘要

Abstract:Designing FPGA-based accelerators for modern artificial intelligence workloads requires exploring a large and complex hardware design space that involves architectural parameters, data flow strategies, and memory hierarchies, making the process very time consuming. While existing methodologies such as SECDA enable rapid hardware-software co-design through SystemC simulation and FPGA execution, identifying efficient accelerator configurations remains a largely manual process requiring extensive domain knowledge. SECDA-DSE is a framework that integrates Large Language Models (LLMs) into the SECDA ecosystem to guide design space exploration (DSE) of FPGA-based accelerators. It combines a structured DSE Explorer for generating candidate architectures with an LLM Stack that performs reasoning-guided exploration using retrieval-augmented generation and chain-of-thought prompting, coupled with a feedback loop for iterative and reinforced refinement. Building on our previous work introducing SECDA-DSE, this paper extends its evaluation by generating three accelerator designs, including element-wise vector multiplication, 2D convolution, and matrix transpose, and performing end-to-end execution on FPGA hardware. The results show that SECDA-DSE can generate SECDA-compliant accelerator designs that are successfully synthesized and executed on FPGA hardware. Furthermore, the generated designs capture kernel-specific trade-offs between compute parallelism and data movement, highlighting the potential of LLM-guided exploration to adapt architectural configurations across diverse workloads while reducing exploration time and the need for extensive human expertise.

[AI-6] RoboNaldo: Accurate Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

链接: https://arxiv.org/abs/2606.11092
作者: Yichao Zhong,Yidan Lu,Yuhang Lu,Tianyang Tang,Haoguang Mai,Yixuan Pan,Tianyu Li,Li Chen,Jingbo Wang,Zhongyu Li,Peng Lu,Hongyang Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: \hrefthis https URL\textthis http URL .

[AI-7] st-Time Gradient Guidance of Flow Policies in Reinforcement Learning

链接: https://arxiv.org/abs/2606.11087
作者: Zhiyuan Zhou,Andy Peng,Charles Xu,Qiyang Li,Tobias Springenberg,Kevin Frans,Sergey Levine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.

[AI-8] Unifying Local Communications and Local Updates for LLM Pretraining

链接: https://arxiv.org/abs/2606.11081
作者: Pietro Cagnasso,Eugene Belilovsky,Edouard Oyallon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 38 pages, 9 figures

点击查看摘要

Abstract:Communication-efficient pre-training of LLMs is increasingly important as training draws on compute distributed across clusters, data centers, and lower-bandwidth links. Many practical methods reduce communication frequency but still rely on synchronous All-Reduce operations that maintain identical model states and tie progress to global collectives. This can become a bottleneck when bandwidth or worker speed is heterogeneous. We introduce GASLoC, a novel decentralized pre-training algorithm that generalizes the notion of communication acceleration to the recently popular “outer optimizer” to allow a practical gossip-based training framework that is compatible with adaptive optimizers, allows for local optimizer steps, and can utilize sparse randomized peer communication. Empirically, on a number of standard LLM training tasks, we demonstrate that GASLoC outperforms state-of-the-art decentralized algorithms in single step per communication setting for a number of topologies and, unlike existing decentralized methods in the LLM setting, it allows to obtain performance competitive with DiLoCo when utilizing multiple local steps. In the heterogeneous bandwidth setting we demonstrate the advantage of GASLoC showing that it can significantly outperform DiLoCo.

[AI-9] CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLM s

链接: https://arxiv.org/abs/2606.11063
作者: Joachim Schaeffer,Thomas Jiralerspong,Alexander Panfilov,Guillaume Lajoie,Jonas Geiping,Yoshua Bengio,Roland S. Zimmermann
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model’s trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbfCIAware-Bench, a benchmark for measuring \textbfcontrol \textbfintervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

[AI-10] What Fits (Into Few Tokens) Doesnt Overfit: Compression and Generalization in ML Research Agents

链接: https://arxiv.org/abs/2606.11045
作者: Martin Andres Bertran,Aaron Roth,Zhiwei Steven Wu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reusing a held-out benchmark adaptively should, in principle, invite overfitting. Yet benchmark-driven machine learning (ML) has produced surprisingly little overfitting in practice. An attractive hypothesis is that successful ML strategies are highly compressible. We study this in the setting of LLM-driven research agents, where the hypothesis becomes directly testable via two complementary information bottlenecks. In \emphoutput compression, an exploration agent adaptively searches for high-performance models using a validation set, and we test whether a fresh ``reproducer agent’’ can reproduce its performance given only an extremely short prompt and the training data. In \emphinput compression, the explorer receives only one-bit feedback indicating whether each submitted model improves on the running best. Across 8 datasets spanning tabular classification, vision, language modeling, diffusion modeling, and reward modeling, we find that these bottlenecks have little effect on performance: short prompts and compressible feedback are sufficient to reproduce and find high-performance models. The hypothesis is falsifiable: when we deliberately induce validation-set overfitting, the results fail to reproduce with short prompts. Taken together, our results support a description-length explanation for the lack of overfitting in benchmark-driven ML: successful strategies occupy a low-complexity region of strategy space.

[AI-11] Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agent ic tasks in Real-World Professional Fields

链接: https://arxiv.org/abs/2606.11042
作者: Liya Zhu,Jingzhe Ding,Jian Zhang,Jianbo Xue,Shihao Liang,Ge Zhang,Xiang Gao,Qingshui Gu,Mailun Gao,Huimin Che,Yan Zhao,Peiheng Zhou,Haojun Wang,Chaobo Xian,Lili Le,Chi Wu,Yiwei Liu,Shengda Long,Jiale Yang,Fangzhi Xu,Sijin Wu,Haodong Duan,Yi Zhu,Chao He,Zhaojian Li,Minchao Wang,Huan Zhou,Jiani Hou,Chuqian Yu,Weiran Shi,Hongwan Gao,Jiamin Chen,Guanhong Chen,Tingqin Luo,Kaiyuan Zhang,Zhixin Yao,Qing Hua,Yuhao Jiang,Jin Chen,Pu Chen,Zhenyu Hu,Xingyu Li,Zhengxuan Jiang,Meng Cao,Tianfeng Long,Haozhe Wang,Mingzhang Wang,Yichen Zhang,Yiming Dai,Chenchen Zhang,Jiaying Wang,Zhiyong Wu,Shen Yan,Yujia Qin,Wenhao Huang,Zaiyuan Wang,Xiaolong Chang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

[AI-12] Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving CVPR2026

链接: https://arxiv.org/abs/2606.11019
作者: Zehan Zhang,Neng Zhang,Yaoyi Li,Jia Cai,Zhiling Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: CVPR2026

点击查看摘要

Abstract:Learning-based motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can accumulate into unstable trajectories, degrading comfort and safety in closed-loop driving. Several methods attempt to inject history as a static conditioning signal to stabilize outputs, only to induce the planner to copy historical patterns instead of adapting to environment contexts. To address this limitation, we propose Diffusion Forcing Planner (DFP), a diffusion-based planning framework driven by history-guided control. Specifically, DFP decomposes the full trajectory into history, current and future segments, and assign independent noise levels to each segment. The model jointly denoises the historical and the future segments, enforcing a heterogeneous joint diffusion process. At inference, classifier-free guidance (CFG) is applied to steer future sampling using annealed history in a controllable manner. Closed-loop evaluation and comprehensive ablations on nuPlan show that DFP achieves competitive performance while producing continuous, stable, and controllable motion plans in complex driving scenarios.

[AI-13] Superficial Beliefs in LLM Decision-Making

链接: https://arxiv.org/abs/2606.11016
作者: Gabriel Freedman,Francesca Toni
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded attributes, we compare the attribute a model says mattered most with the attribute that best explains its choice under a behavioural model fit to prior decisions. The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random. However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially. The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly. This qualitative pattern persists across prompt-order and sampling perturbations, alternative behavioural models, targeted occlusion analyses, and structurally varied decision settings. We interpret this as evidence for ``superficial belief’’ in LLM decision-making: models behave as if guided by probabilistic local priorities over attributes, while having only limited verbal access to the attributes that drive their decisions.

[AI-14] Structure from Reasoning Numbers from Search: On-Premise Open LLM s as Structural Priors for Coupled MIMO Controller Tuning

链接: https://arxiv.org/abs/2606.11015
作者: Jiaxuan Chen,Haonan Li,Yang Shu
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, 6 tables. Submitted to IEEE Access

点击查看摘要

Abstract:Tuning controllers for strongly coupled multi-input multi-output (MIMO) industrial processes is hard: decentralized classical auto-tuning ignores loop interaction, and local numerical optimization from natural initializations stalls in the resulting non-convex cost landscape. We ask whether on-premise open-source large language models (LLMs), which keep data on-site and need no plant model, can help. On a single-loop CSTR, classical relay-feedback tuning (IAE 0.106, near the 0.102 optimum) beats an LLM tuner (0.162): for simple loops the LLM adds nothing. The picture inverts on a strongly coupled quadruple-tank with conflicting set-points, scored by a penalized cost J = IAE + lambda*TV(u) that rewards tracking without chattering actuators. There, naive relay tuning (J ~ 28.6) and naive LLM tuning (29.7) are no better than open loop (22.7), and a local optimizer from balanced starts fails in 10/10 runs. A scaffolded open LLM instead reasons about the coupling, proposes the counter-intuitive asymmetric structure, and reaches J ~ 16.9 +/- 0.2 from any start; refining it with a classical optimizer attains the smooth global optimum (J ~ 12.0, 10/10 vs. 0/10), which even applies a non-obvious negative integral correction decentralized tuning cannot. A global optimizer (differential evolution) also reaches this optimum, so the LLM is not the only route; its advantage is sample efficiency and interpretability: a usable controller in 18 evaluations (where the global optimizer is worse than open loop) plus a stated rationale. This edge grows with dimension, reaching ~6x fewer evaluations on a 3x3 plant. The behaviour generalizes across four open models, and on a benign plant the LLM offers no advantage, sharpening the boundary. We contribute a reproducible benchmark delimiting when open LLMs help in control tuning: not as optimizers, but as a sample-efficient, interpretable structural prior.

[AI-15] Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

链接: https://arxiv.org/abs/2606.11007
作者: Junchang Zheng,Junfeng Tan,Jialiang Lin
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Work in progress

点击查看摘要

Abstract:OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute complex, multi-step tasks has attracted an ever-growing and diverse user base. However, this capability comes with significant risks. While existing research has made important strides in characterizing these threats, such work is predominantly directed at technically sophisticated audiences. It remains largely inaccessible to non-technical users. This demographic now makes up an increasingly large and underserved portion of the community, yet it is these very users who most urgently need practical and straightforward guidance. In response, we bridge this gap through a series of interconnected efforts designed to lower the risk barrier for non-technical OpenClaw users. First, we identify and categorize seven core risks that OpenClaw users may encounter in daily usage, explaining each in plain language so that non-technical users can readily grasp the nature and potential consequences of these threats. Second, for each identified risk, we distill a set of corresponding defensive strategies into clear and actionable operational steps that are easy to follow. Third, to make protection even easier, we provide a companion OpenClaw Skill that automates key security configurations, enabling users to safeguard their systems with minimal manual intervention. Through this work, we demonstrate that safeguarding against the risks of intelligent agents need not be the exclusive domain of security experts, and that non-technical users can meaningfully participate in reducing these risks through simple, practical actions.

[AI-16] Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

链接: https://arxiv.org/abs/2606.10989
作者: Bocheng Ju,Jianhua Wang,Chengliang Liu,Xiaolin Chang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model unlearning aims to suppress designated undesirable knowledge while preserving benign capabilities. Many unlearning objectives focus on suppressing undesired answers, while recent target-guided variants specify replacement behavior but still leave update locality largely unconstrained. This paper introduces \emphNull-Space Constrained Response-Specified Unlearning (NSRU), a projection-constrained low-rank framework for controlled LLM unlearning. NSRU uses an explicitly structured safe target response to specify the desired behavior for each forget query, while suppressing the original undesired content. To localize adaptation, NSRU estimates per-module retain subspaces from benign hidden representations and uses an orthogonal-projected low-rank parameterization to confine LoRA updates to the null space of the retain subspace. The resulting objective jointly optimizes safe-target learning, undesired-response suppression, and retention preservation under this constrained parameterization. We provide a local first-order analysis showing that the projected update reduces retain-side perturbations while preserving editable directions for shaping forget-query behavior. Experiments on TOFU show that NSRU effectively suppresses extractable forget-set knowledge while improving retain QA performance, model utility, and safe-target alignment over representative baselines. On WMDP, NSRU keeps hazardous-domain accuracy near the random-choice region while preserving broad and domain-adjacent MMLU utility. Ablation studies support the complementary roles of safe-target supervision, undesired-response suppression, retention loss, and null-space projected updates, while sensitivity and robustness analyses indicate stable behavior across the tested hyperparameter and prompt variations.

[AI-17] Bellm an-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

链接: https://arxiv.org/abs/2606.10979
作者: Yi Chen,Rushuai Yang,Qiang Chen,Dongyan(Lucy)Huo
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many Markov decision processes (MDPs) in operations research have feasible actions that are state dependent and defined implicitly by various operational constraints. These features make it difficult to use standard deep reinforcement learning (DRL) algorithms, whose action interfaces typically assume either a fixed finite action catalog or a simple Euclidean space. Motivated by a Taylor expansion of the optimal action-value function, we propose Bellman–Taylor score decoding, a framework that moves policy learning to a Euclidean score space while enforcing feasibility through an action decoder. The induced latent-score MDP then can be optimized by standard DRL algorithms without differentiating through the decoder. We provide a performance guarantee showing that the optimality gap of this approach decomposes into a structural approximation error and an algorithmic learning error. Lastly, we apply this framework to a queueing network control problem, where the policy essentially learns a state-dependent index-based dispatching rule. Numerical experiments show near-optimal performance in small instances and considerable improvements over benchmarks in larger systems.

[AI-18] Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

链接: https://arxiv.org/abs/2606.10968
作者: Renjie Mao,Xiangxin Zhou,Lvfang Tao,Yixin Ding,Yu Shi,Yongguang Lin,Yuheng Wu,Honglin Zhu,Qian Qiu,Wenxi Zhu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.

[AI-19] Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

链接: https://arxiv.org/abs/2606.10949
作者: Shelly Bensal,Axel Magnuson,Aparna Balagopalan,Daniel M. Bikel
类目: Artificial Intelligence (cs.AI)
备注: Under submission; preprint

点击查看摘要

Abstract:Persistent memory systems promise to make LLMs more helpful by storing user beliefs over time. We show they also make models less correct by systematically amplifying sycophancy, wherein models prioritize agreement with users over accuracy. We conduct the first systematic evaluation of this effect, introducing MIST: a benchmark of synthetically generated multi-turn conversations where users express plausible misconceptions in scientific, medical, and moral reasoning domains. Testing across three state-of-the-art memory systems and five model families reveals that memory amplifies sycophantic behavior across all conditions, with up to 25x higher sycophancy rates than in-context baselines. Error analyses suggest memory extraction as the primary culprit: lossy compression into discrete snippets encodes user misconceptions while discarding corrective context. Based on these results, we propose two lightweight mitigations that substantially reduce sycophancy while matching or exceeding memory systems at factual recall.

[AI-20] Generative Explainability for Next-Generation Networks: LLM -Augmented XAI with Mutual Feature Interactions

链接: https://arxiv.org/abs/2606.10942
作者: Kiarash Rezaei,Omran Ayoub,Sebastian Troia,Francesco Lelli,Paolo Monti,Carlos Natalino
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, with one page for appendix. Accepted for publication at the 2025 21th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob)

点击查看摘要

Abstract:As artificial intelligence and machine learning (AI/ML) models become integral to network operations, their lack of transparency poses a significant barrier to operator trust. Existing explainable artificial intelligence (XAI) techniques often fail to bridge this gap for non-specialists, producing technical outputs that are difficult to translate into actionable insights. This paper presents a framework specifically designed to address this shortcoming. It leverages a moderately sized large language model (LLM) and extends beyond the standard use of SHapley Additive exPlanations (SHAP) feature influence values. The framework employs a structured prompt enriched with mutual feature interaction data to generate human-understandable natural language explanations. To validate our framework, we performed an empirical evaluation on an optical quality of transmission (QoT) estimation use case with human evaluators. We collected independent performance evaluations from specialists, which showed a high inter-evaluator agreement. Compared to a state-of-the-art baseline that uses only SHAP feature influence values in a straightforward prompt, our approach improves the explanation usefulness and scope by 12.2% and 6.2%, while achieving 97.5% correctness.

[AI-21] Provenance Tracking in AI Compilers through the Lens of Coalgebra

链接: https://arxiv.org/abs/2606.10937
作者: Zilu Tian,Liying Liu
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI compilers aggressively rewrite computation graphs through normalization, lowering, and optimization, making it difficult to track the provenance of tensors and operators across compilation. Reliable provenance is essential for attaching platform-specific postprocessing, debugging compiler behavior, and validating transformations, yet existing solutions are either invasive or ad hoc under non-injective graph rewrites. We present a lightweight, generative approach to provenance tracking based on observational semantics. Instead of propagating identifiers through compiler passes, we observe graph transformations and reason about provenance in terms of observable computational actions. We formalize this approach using a coalgebraic model and bisimulation, which preserves provenance even when intermediate nodes are eliminated. Furthermore, we implement this approach in a prototype AI compiler COVAN, demonstrating stable provenance across compilation pipelines with minimal engineering overhead. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.10937 [cs.DB] (or arXiv:2606.10937v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2606.10937 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

链接: https://arxiv.org/abs/2606.10935
作者: Xuezhen Xie,Zhiqiang Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Large language model inference is bottlenecked by autoregressive decoding, where each token requires a full forward pass. Multi-token prediction (MTP) offers a promising acceleration path, but existing approaches suffer from a fundamental architectural flaw: the MTP head for the first token competes with the backbone’s own language model (LM) head, leading to severe quality degradation when predictions are accepted. We identify this head-backbone competition as the root cause of repetitive and incoherent outputs in prior MTP-based acceleration methods. To address this, we propose Backbone-as-Architect, a design principle where the backbone LM head always generates the first token, and MTP heads are responsible only for subsequent tokens. Building on this principle, we introduce CLP (Collocation-Length Predictor), a lightweight span-level decision layer that predicts how many additional tokens can be safely accepted at each decoding step. CLP uses only a single linear layer (4.6K–7.7K parameters), replacing the over-engineered 1M-parameter gate networks used in prior work. Experiments on Qwen2.5 models (0.5B, 1.5B, 7B) show that CLP achieves 1.20x–1.29x speedup on 1.5B and 1.14x–1.20x on 7B, with zero quality degradation (repetition ratio 0.02), while gate-based approaches fail to accelerate (1.07x) or produce severely degraded outputs (repetition ratio 0.5%). We further demonstrate that shorter prediction horizons (k=2) recover 24% higher MTP head accuracy on large models, establishing a scaling-aware design principle. We identify MTP head prediction accuracy as the binding constraint on acceleration and establish a clear roadmap for future improvements.

[AI-23] WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds

链接: https://arxiv.org/abs/2606.10934
作者: Fabio Rovai
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A common assumption holds that enough observational and interventional data, given to a strong enough predictor, suffices. We report a failure mode that contradicts it. Across hundreds of structural causal models, on identified quantities a strong predictor and a Bayesian baseline both succeed, but on unidentified quantities (the couplings between counterfactual worlds) the predictor collapses to a point, on 28% of models to one no valid model can produce, while the truth is an admissible interval more data never narrows. The gap is structural: prediction cannot represent uncertainty over counterfactual couplings. We cast a world model as a single positive semidefinite coupling kernel K(T,T’) over admissible worlds, whose diagonal is the ordinary posterior (what a predictor recovers) and whose off-diagonal is the cross-world coupling it cannot, which every counterfactual reads. The paper is the theory of that off-diagonal. It is real: two states with identical posteriors differ on a cross-world query, and the off-diagonal is the coupling that fixes counterfactuals. It can be bounded: positive semidefiniteness is partial-identifying information the marginals lack, and enforcing it bounds counterfactuals in polynomial time where the exact response-type program is intractable. Logical structure sharpens it: ontology axioms tighten the bound by up to a third, propagating to couplings they never touch. It can be acquired: targeted scars, constraints learned from encountered infeasibilities, close the gap several times faster than untargeted ones. Its full reconstruction is approximate counting of the admissible worlds, tractable below the Sly-Sun threshold and inapproximable above; we do not claim to beat the worst case.

[AI-24] Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

链接: https://arxiv.org/abs/2606.10933
作者: Aman Sharma,Sushrut Thorat,Paras Chopra
类目: Artificial Intelligence (cs.AI)
备注: 43 pages, 8 figures

点击查看摘要

Abstract:LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden-test grading. Our protocol exposes capability differences between these agents that mainstream coding and agentic benchmarks such as SWE-Bench Verified and Terminal-Bench 2.0 compress into much narrower bands. We observe that the strongest agents, Claude Opus 4.6 and GPT-5.4 xhigh, often avoid writing the target language directly. On Brainfuck and Befunge-98, they write Python programs that generate target-language code and debug those generators locally. Forbidding this metaprogramming strategy causes large performance drops. Text guidance distilled from this strategy does not materially improve weaker agents. In contrast, Opus-derived Python helper code for building generators, with no solved benchmark programs or hidden-test answers, sharply improves Sonnet 4.6 and GPT-5.4 mini on the same problems, while Haiku 4.5 remains low. More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that these resources amplify useful strategies rather than create them. Together, these results show that strong coding agents adapt to unfamiliar languages by using tools, feedback, and workspace state to build a working model of the target language. Metaprogramming is the clearest case, but the broader gap is constructing and debugging a strategy that works under the target language’s rules.

[AI-25] Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

链接: https://arxiv.org/abs/2606.10929
作者: Irina Piontkovskaia,Sergey Nikolenko
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 tables, 9 figures

点击查看摘要

Abstract:Task vectors, LoRA, activation steering, and random search around pretrained weights all suggest that learned behaviour can be controlled by linear directions. We ask which linear structures actually exist and on what scale. In a synthetic multitask transformer and LoRA adapters on DistilGPT-2 / GPT-2 we find strong local low-rank task-gradient structure but reject the fixed-task-plane hypothesis: static bases miss the recovery direction, and the useful basis drifts substantially within 100 steps. However, the first recovery updates form a trajectory-prefix basis capturing 77% of the LoRA recovery displacement. We develop random search theory with a Gaussian local-linear theorem that justifies the effectiveness of random parameter search even in very high dimensions. We also study the relation between parameter perturbations and activation steering: a single gradient step produces an activation shift with 0.58 cosine to a labelled-contrast CAA steering vector, with a similar steering effect on Qwen-0.5B BoolQ statements. We validate our results with experiments on synthetic Transformers and LLMs. Our results suggest that linear structures in trained networks are not global task directions, but evolving local geometries that partially persist across parameter and activation spaces.

[AI-26] A Constrained Natural-Language Interface for Variational Multi-Physics Finite Element Simulations in FEniCS

链接: https://arxiv.org/abs/2606.10928
作者: Nilay Upadhyay,Wesley F. Reinhart
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: 23 pages, 17 figures

点击查看摘要

Abstract:Large language models can reduce the manual effort required to set up finite element simulations, but they introduce reliability risks when generated solver code lies on the critical path. We present a constrained natural-language interface for multi-physics finite element analysis in which the LLM is limited to front-end tasks: parsing prompts into structured JSON, generating Gmsh code only for non-catalog geometries, and using retry feedback for those stages. It never writes FEniCS solver templates, derives weak forms, or writes the numerical solver core. A deterministic dispatcher maps the validated specification to five human-written FEniCS/UFL templates: linear elasticity, hyperelasticity, elastoplasticity, thermo-mechanical coupling, and phase-field fracture. We validate this deterministic template layer against analytical solutions and published 2D/3D benchmarks. Smooth cases reach sub-percent agreement on adequate meshes, while harder nonlinear cases reach the 2-5 percent range. We also evaluate the LLM-facing front end directly. In a 15-prompt parser benchmark, first-pass valid parses were obtained for 9 cases, and all remaining cases were repaired after retry, giving a final valid parse rate of 100.0 percent, 100.0 percent problem-class accuracy, and 97.1 percent field-extraction accuracy. In a 10-case custom-geometry benchmark routed through the real LLM-to-Gmsh path, first-pass and final success were both 90.0 percent, with one unrecovered invalid-geometry failure. These results show that the parser and constrained prompt/validation design are effective on these benchmarks. As an end-to-end demonstration, the system generates and analyzes a 3D elastoplastic L-bracket with a fillet and bolt hole from one natural-language prompt. The contribution is a measured architecture for natural-language-driven variational simulation, not open-ended autonomous code generation.

[AI-27] Role-Agent : Bootstrapping LLM Agents via Dual-Role Evolution

链接: https://arxiv.org/abs/2606.10917
作者: Xucong Wang,Ziyu Ma,Shidong Yang,Tongwen Huang,Pengkun Wang,Yong Wang,Xiangxiang Chu
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, including 12 pages of main text and 8 pages of appendix; work in progress

点击查看摘要

Abstract:Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper introduces Role-Agent, \textcolorblacka framework that harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co-evolution. Role-Agent comprises two synergistic components: World-In-Agent (WIA) and Agent-In-World (AIW). In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment-aware reasoning. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice. Experiments on multiple benchmarks show that Role-Agent consistently improves performance, yielding an average gain of over 4% over strong baselines.

[AI-28] What Do Deepfake Speech Detectors Actually Hear? INTERSPEECH2026

链接: https://arxiv.org/abs/2606.10912
作者: Vojtěch Staněk,Veronika Jirmusová,Anton Firc,Kamil Malinka,Jakub Reš,Martin Perešíni
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.

[AI-29] Ethical and Technical Limits of Deepfake Speech Datasets INTERSPEECH2026

链接: https://arxiv.org/abs/2606.10911
作者: Vojtěch Staněk,Eva Trnovská,Kamil Malinka,Anton Firc
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.

[AI-30] RAT: Reference-Augmented Training for ASV Anti-Spoofing INTERSPEECH2026

链接: https://arxiv.org/abs/2606.10908
作者: Vojtěch Staněk,Anton Firc,Jakub Reš,Kamil Malinka
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.

[AI-31] Optimal Post-Training Quantization Scales and Where to Find Them

链接: https://arxiv.org/abs/2606.10890
作者: Juan Amboage,Pablo Monteagudo-Lago,Ian Colbert,Giuseppe Franco,Nicholas Fraser
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that defines the quantization grid is typically chosen using simple, data-free heuristics. In this work, we present PiSO (Piecewise Scale Optimization), an algorithm that leverages calibration data to compute the optimal channel-wise weight scales exactly and efficiently under round-to-nearest quantization. PiSO partitions the scale search space into finitely many intervals on which the objective admits a closed-form minimizer. We extend PiSO to group-wise quantization via principled heuristics and propose effective strategies for interleaving scale optimization with error correction. Experiments on Llama and Qwen models across multiple model sizes and target weight bit-widths demonstrate consistent improvements in perplexity and downstream zero-shot accuracy, both standalone and combined with error correction. In particular, we observe increased benefits as the target bit-width narrows and quantization becomes more challenging.

[AI-32] Large-scale semantic mapping of learner agency and autonomy reveals what measurement and generative AI research overlook

链接: https://arxiv.org/abs/2606.10881
作者: Fei Qin,Xiaobo Liu,Yaowen Zhang,Xuming Li,Fei Wang,Mutlu Cukurova,Jingjing Chen,Yu Zhang
类目: Artificial Intelligence (cs.AI)
备注: 45 pages, 12 figures, 1 table, including appendices

点击查看摘要

Abstract:Learner agency and autonomy are foundational to personal development, yet a pervasive “jingle-jangle” fallacy (i.e. identical terms denoting different constructs, distinct terms denoting identical ones) has substantially hindered cumulative knowledge. Treating meaning as a phenomenon constituted through use in linguistic practice, we extracted 8,954 definitions and 2,700 scale items from over 14,000 publications, to investigate how researchers actually used learner agency and autonomy with a semantic analysis pipeline. The definitional landscape of two constructs resolves into three dimensions: regulation and control of learning (task), intrinsic motivation and internal decision-making (person), and social-relational action (sociocultural), thereby empirically quantifying the jingle-jangle fallacy. Existing scales, however, systematically underrepresent the sociocultural dimension. Critically, current generative AI research in education concentrates on learning regulation and control, narrowing the behavioral repertoire that AI-mediated learning environments are designed to cultivate. Beyond conceptual clarification, this work carries direct implications for conceptualization, measurement, and practice towards supporting the multidimensional learner agency and autonomy.

[AI-33] Geometrically Averag ed Hard Target Updates for Linear Q-Learning

链接: https://arxiv.org/abs/2606.10835
作者: Donghwan Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Periodic hard target updates are among the most common stabilization devices in modern deep Q-learning. Recent studies suggest that target updates can improve stability in Q-learning with function approximation, including linear function approximation. We introduce and analyze the so-called \lambda -target update, obtained by averaging the m -periodic target update maps with \lambda -geometric weights (1-\lambda)\lambda^m-1 , \lambda \in [0,1] . The endpoint \lambda=0 recovers the one-period target update, while the continuous endpoint \lambda\uparrow1 recovers projected Q-value iteration. We study this mechanism for Q-learning with linear function approximation, namely linear Q-learning, using a switching-system model and related tools. For clarity, the paper treats a deterministic version; the formulation extends to stochastic reinforcement-learning settings.

[AI-34] Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

链接: https://arxiv.org/abs/2606.10833
作者: Syed Wasiq,Syed Mohamad Tawseeq,Yashwant Pravinrao Bangde,Debaditya Roy
类目: Artificial Intelligence (cs.AI)
备注: 9 pages (main text), 4 figures, 2 tables; 50 pages total including appendix. The first two authors contributed equally

点击查看摘要

Abstract:Vision-Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi-step reasoning. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision-making, where reasoning failures may produce physically invalid yet superficially plausible solutions. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes. We introduce EngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems. We introduce an 8-stage automatic evaluation framework for assessing VLM-generated solutions. The framework independently evaluates each stage of the solution, enabling fine-grained analysis of reasoning failures. We benchmark multiple state-of-the-art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0.975 and a mean absolute error of 0.67 on a 10-point grading scale. Our results highlight the importance of process-oriented evaluation for reliable assessment of multimodal engineering reasoning systems.

[AI-35] A Unified Siamese Learning Framework for Zero-Day Anomaly Detection and Classification in Optical Networks

链接: https://arxiv.org/abs/2606.10827
作者: Carlos Natalino,Flávia P. Monteiro,Paolo Monti
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Authors’ version of the manuscript accepted and published at the Optical Fiber Communication Conference (OFC) 2026. 4 pages, 3 figures

点击查看摘要

Abstract:A multi-similarity Siamese neural network unifies zero-day anomaly detection and one-shot classification in optical networks, achieving over 99% accuracy and instant adaptability across lightpaths and unseen anomaly types without any retraining.

[AI-36] Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation

链接: https://arxiv.org/abs/2606.10806
作者: Xiaoyang Chen,Xiang Jiang
类目: Artificial Intelligence (cs.AI); Functional Analysis (math.FA)
备注:

点击查看摘要

Abstract:Moonshine is an autonomous agent whose central objective is to generate mathematical conjectures. Its core capability is to extract structure from classical problems, distill new concepts, and formulate conjectures of mathematical significance. Rather than treating the solution of a single proposition as its endpoint, Moonshine builds an extensible theoretical framework through conjecture generation, bridge building, and obstacle identification. This article uses Moonshine’s exploration of the Jacobian conjecture as an example. It shows how the central logic of whether local nondegeneracy can force global injectivity is transferred to one-hidden-layer affine-ridge sigmoid networks. This leads to the formulation of the \emphNeural Jacobian Conjecture (NJC): if such a network has strictly positive Jacobian determinant on the whole space, then it must be globally injective. By invoking GPT-5.5-pro and DeepSeek-V4-pro separately, Moonshine obtained independent complete proofs for the case (N=n+1). In addition, with the assistance of ChatGPT through interactive use of its web interface with GPT-5.5-pro, a geometric-topological proof was developed. These results provide preliminary evidence for the plausibility of the conjecture. The general higher-width case (N\ge n+2), however, remains unresolved and is left for further investigation. This work illustrates Moonshine’s ability to autonomously generate meaningful mathematical problems and make rigorous progress on them.

[AI-37] Boosting ECG Classification Performance by Pre-training with Synthesized Data

链接: https://arxiv.org/abs/2606.10802
作者: Naoki Nonaka,Jun Seita
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) typically require extensive datasets for effective training. In the medical domain, acquiring large-scale data is often challenging due to privacy concerns and the rarity of certain diseases. To address this data scarcity, we investigate the efficacy of training DNN models using synthetic data, generated based on domain-specific medical knowledge. Specifically, we develop a knowledge-driven Gaussian-composition synthesis algorithm for single-lead II ECGs, in which each heartbeat is represented by Gaussian-shaped P, Q, R, S, and T wave components. Using this simulator, we generate synthetic data for four abnormal electrocardiogram (ECG) classes: atrial fibrillation (AF), atrial flutter (AFLT), premature ventricular complex (PVC), and Wolff-Parkinson-White Syndrome (WPW). We evaluate the utility of this synthetic data by conducting abnormal ECG classification using ten different DNN architectures. Our results demonstrate that synthetic-to-real training improves classification performance for three of the four target abnormalities, with the largest architecture-averaged gain of 33.2% observed for AFLT. Further analysis reveals that the performance enhancement from synthetic data is more pronounced with smaller real-world datasets. These findings suggest that domain-knowledge-based synthetic ECGs can serve as a useful pre-training resource, particularly in scenarios where real-world data are limited or difficult to obtain.

[AI-38] Evaluating Research-Level Math Proofs via Strict Step-Level Verification

链接: https://arxiv.org/abs/2606.10799
作者: Yifeng Sun
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from “context poisoning,” in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To address this, we shift from global evaluation to strict step-level verification: our framework maintains detailed context for each deduction step and strictly constrains the sources of applied theorems. We evaluate on a carefully curated adversarial diagnostic suite of research-level proofs drawn from the FirstProof challenge. A systematic ablation study demonstrates that these deductive constraints are indispensable, as unconstrained global prompting consistently fails to localize subtle logical errors. Beyond outperforming global evaluation, our approach fundamentally alters the failure taxonomy. Error analysis reveals that, rather than exhibiting severe logical hallucinations, remaining rejections are primarily instances of “pedantic hyper-rigor” stemming from unstated domain conventions, effectively exposing implicit ambiguities within the expert benchmark itself. Our findings suggest that prompting agents to organize their verification notes in a cautious, human-mathematician-like manner can substantially improve their ability to distinguish rigorous proofs from flawed ones, with the potential to strengthen agentic reasoning on frontier mathematical concepts that the base model does not already know well, and to lay a theoretical foundation for future automated proof-review systems. Code and prompts are available at GitHub.

[AI-39] READER: Robust Evidence-based Authorship Decoding via Extracted Representations

链接: https://arxiv.org/abs/2606.10794
作者: Jiaxu Liu,Sunnan Mu,Dong Huang,Liuyin Wang,Jing Shao,Jie Zhang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As agentic applications increasingly route user tasks through official and third-party LLM APIs, provenance becomes an operational question: which model generated a given black-box response? We study Dynamic Black-Box LLM Provenance: identifying the source LLM from generations elicited by query-varying, non-predefined prompts rather than a fixed input set or benchmark suite. This setting is difficult because prompt semantics dominate the text, while model-specific authorship traces are weak and inconsistent at the surface level. We introduce READER (Robust Evidence-based Authorship Decoding via Extracted Representations), a lightweight provenance framework that treats a frozen proxy LLM as a reader of hidden authorship evidence. READER maps black-box outputs into proxy activation space, temporally filters token states within each response, and performs Bayesian Evidence Accumulation by summing single-response log-posterior evidence across independently sampled prompts. This avoids fragile mean-pooling of prompt-specific representations while preserving the query-wise evidence needed for calibrated confidence. On Agent500, a 50-target dataset built from agent-style prompts, READER reaches 31.0 - 42.4% top-1 accuracy from a single response and 70.0 - 84.0% from 50 responses, substantially outperforming sentence-encoder fingerprints. Scaling across nine proxy readers further shows that stronger LLMs expose more linearly decodable authorship structure, suggesting that authorship perception is already present in frozen LLM representations and can be converted into reliable multi-query attribution.

[AI-40] Accelerating NeurASP with vectorization and caching

链接: https://arxiv.org/abs/2606.10787
作者: Alexander Philipp Rader,Alessandra Russo
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 16 pages, 5 figures, to be published in the Theory and Practice of Logic Programming (TPLP) journal for the 42nd International Conference on Logic Programming (ICLP) issue

点击查看摘要

Abstract:Neurosymbolic AI combines neural networks with symbolic programs to create robust and explainable predictions. One such framework is NeurASP, which trains a neural network to predict concepts and reasons over them using rules written in answer set programming (ASP) to solve downstream tasks. Crucially, labels are only provided for the downstream prediction produced by the symbolic rules, not for the latent concepts this http URL through the non-differentiable ASP component requires expensive probability and gradient calculations, which has hindered scalability to more sophisticated this http URL this paper, we address the current limitations of NeurASP by improving its computational performance through vectorization, batch processing and caching of intermediate computations during training. We compare computation speeds between the original and our new implementation of NeurASP and report speedups of multiple orders of magnitude for larger tasks. To this end, we propose a new dataset of difficult tasks involving playing cards, which we use to test the capabilities of NeurASP’s enhanced learning function.

[AI-41] A Bayesian Network Approach for Enhancing Security-Focused Decision Support Systems

链接: https://arxiv.org/abs/2606.10782
作者: Carolina Fernández-Martínez,Shuaib Siddiqui,Vanesa Daza
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The adoption and integration of heterogeneous stacks in most of today’s open-source based networks brings clear benefits like interoperability and availability of advanced features. Yet, on the other hand the increasing number of interconnecting components and moving parts requires maintaining an ever increasing base of interdisciplinary knowledge of different tools in different domains to ensure proper operation. To alleviate such efforts, this work proposes a Decision Support System (DSS) to guide infrastructure operators through the selection of security approaches (e.g. tools) to adopt in their environments. This framework easily captures the end-user high-level requirements on the security triad for different domains and runs inference on the designated models to provide the identified tools (security mechanisms) that better serve such needs. The presented DSS aims at delivering an understandable and extensible framework to accommodate varying requirements and Bayesian Network (BN) models. The architecture and modelling of the system are proposed, aligned with its theoretical framework. Its performance is evaluated in terms of time and prediction accuracy.

[AI-42] AutoPDE: Reliable Agent ic PDE Solving via Explicitly Represented Solver Strategies

链接: https://arxiv.org/abs/2606.10752
作者: Huanshuo Dong,Keyao Zhang,Hong Wang,Zhezheng Hao,Zhiwei Zhuang,Ziyan Liu,Jiacong Wang,Gengyuan Liu,Xin Jin
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically move directly from a PDE problem to solver code, leaving the solver strategy implicit in implementation details. Feedback from a failed solve is therefore routed back to code edits rather than to the underlying strategy, so numerical decisions remain hard to check before code is generated and hard to revise using numerical evidence when it fails. To address this limitation, we propose AutoPDE, a code agent that maintains the solver strategy as an explicitly represented object throughout the solving process: an independent, inspectable object that is built before any code is written and can be revised, using numerical evidence, whenever a solve fails. AutoPDE builds and maintains this object in three stages, all drawing from a library of reusable PDE-solving skills: PDE analysis identifies the equation type and algebraic structure; numerical method selection chooses a numerical method that matches the analysis result and commits to a discretization, stabilization, and linear solver accordingly; and adaptive tuning runs low-cost pilot solves to calibrate resolution and tolerances under the prescribed accuracy and runtime budget. We evaluate AutoPDE on the PDE Agent Bench, where experimental results show that AutoPDE achieves a pass rate of 54.5% , improving over the strongest baseline by 14.2 percentage points.

[AI-43] oward Secure LLM Agents : Threat Surfaces Attacks Defenses and Evaluation

链接: https://arxiv.org/abs/2606.10749
作者: Yuchen Ling,Shengcheng Yu,Zhenyu Chen,Chunrong Fang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer limited to unsafe text generation. Untrusted content may redirect control flow, misuse tool privileges, corrupt persistent state, leak sensitive information, or trigger harmful external actions. At the same time, research on LLM agent security is expanding quickly but remains fragmented across attack families, defense layers, application domains, and evaluation settings. This paper synthesizes 247 papers through a lifecycle-based, systems-oriented framework that models agent security around the interaction of information flow, delegated authority, and persistent state. We organize the literature around four questions: how LLM agent security should be modeled, which threat surfaces and attack families dominate, what defenses have been proposed and with what tradeoffs, and how security claims are evaluated. We find that prompt injection and tool-mediated control-flow hijacking still dominate the field, while persistent state corruption and multi-agent propagation are becoming central emerging concerns. We further find that current defenses provide useful building blocks but remain weakly compositional, and that existing benchmarks still underrepresent long-horizon, stateful, and deployment-sensitive risks. We argue that secure LLM agents require explicit trust boundaries, principled privilege control, provenance-aware state management, and evaluation practices aligned with realistic operational settings.

[AI-44] he Arbiter Agent : Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

链接: https://arxiv.org/abs/2606.10747
作者: Filippo Tonini,Federico Torrielli,Anton Danholt Lautrup,Peter Schneider-Kamp,Mustafa Mert Çelikok,Lukas Galke Poech
类目: Artificial Intelligence (cs.AI)
备注: AITC 2026

点击查看摘要

Abstract:As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited “inspection budget”, meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at this https URL.

[AI-45] ransformer Based Model for Spatiotemporal Feature Learning in EEG Emotion Recognition

链接: https://arxiv.org/abs/2606.10718
作者: Xinglong Cui,Dian Gu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electroencephalography (EEG) is a widely adopted technique for monitoring brain activity, offering valuable insights into neurological states due to its high temporal resolution and cost-effectiveness. To enhance the analysis of complex EEG data, we propose EEG-TransNet, an architecture designed to capture temporal, regional, and synchronous features of EEG signals. EEG-TransNet introduces three key modules: 1) a preprocessing and feature extraction module leveraging ResNet and wavelet-based denoising, 2) a Local Self-Attention Block for regional feature learning, and 3) a Fuzzy-Attention Synchronous Transformer (FAST) to model spatiotemporal dependencies. Through extensive experiments on three EEG datasets (BETA, SEED, and DepEEG), the proposed model consistently outperforms other methods in terms of classification accuracy and robustness across varying signal lengths. Ablation studies confirm the contribution of the Local Self-Attention Block in improving performance, and the inclusion of depthwise separable convolutions in the decoder reduces computational complexity while maintaining high accuracy. EEG-TransNet’s ability to generalize across subjects with minimal performance variation highlights its potential as a robust tool for EEG-based brain activity classification and emotion recognition tasks.

[AI-46] Unifying Data Memory and Compute Efficiency in LLM training: A Survey

链接: https://arxiv.org/abs/2606.10706
作者: Vanessa Schmidt,Huy Hoang Nguyen,Cédric Jung,Shirin Salehi,Anke Schmeink
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accpeted for publication in IEEE Transactions on Artificial Intelligence (TAI)

点击查看摘要

Abstract:Resource constraints increasingly determine what can be trained, fine-tuned, and deployed in large language models (LLMs), yet efficiency is often studied through isolated techniques rather than as an interacting system of limits. This survey adopts a constraint-centric perspective and organizes recent progress around three coupled bottlenecks: data efficiency (what to train on), memory efficiency (how to fit training), and compute budget awareness (when and where to spend FLOPs). On the data axis, we review selection and pruning methods that maximize learning per token, ranging from scalable proxy signals based on learning dynamics to gradient- and influence-based scoring, as well as difficulty-aware and curriculum-style strategies. We highlight emerging evidence that different notions of good data dominate in different regimes, implying that optimal subsets depend on the task objective and resource budget rather than being universal. On the systems side, we show that GPU memory, not raw compute, is often the dominant bottleneck in fine-tuning, and that effective scaling requires jointly reducing weight storage, optimizer states, and activation memory rather than optimizing any single component in isolation. Beyond memory, we frame training and inference as compute-governed processes in which optimization, data selection, and decoding must explicitly account for finite FLOP budgets. We review evidence for compute-optimal allocation and stopping rules, where computation should be halted or reallocated once marginal performance gains fall below a budget-dependent threshold. Together, these results unify compute-aware data selection, scaling laws, and adaptive inference under a common principle of resource-conditioned decision-making.

[AI-47] Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

链接: https://arxiv.org/abs/2606.10705
作者: Yavar Yeganeh,Mahsa Shekari,Nicla Frigerio,Daniele Pagano,Andrea Matta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. These characteristics yield complex, high-dimensional decision problems with delayed feedback and long-horizon requirements, complicating production planning and control. We propose a deep reinforcement learning framework for multi-objective policy optimization at this scale. Specifically, we formulate control as a centralized-agent problem, where a core policy coordinates system-wide decisions, while system evolution is represented as an interconnected temporal process driven by discrete events. Accordingly, we develop a tailored event-driven temporal-difference formulation that remains general and can be integrated with various policy optimization methods under relevant training settings. We investigate several core model-free algorithms incorporated into this framework and evaluate their effectiveness using high-fidelity simulations of diverse, industry-real operating scenarios. Across extensive validation experiments, agents trained in both offline and online settings show significant and consistent gains in throughput and utilization. We further evaluate performance and generalization across training phases, clarifying the relative strengths of alternative reinforcement learning formulations and algorithms. Overall, the results support the scalability, generality, and transferability of the proposed framework for controlling event-driven complex adaptive systems.

[AI-48] Divide and Cooperate: Role-Decomposed Multi-Agent LLM Agent LLM Training with Cross-Agent Learning Signals

链接: https://arxiv.org/abs/2606.10684
作者: Jaewan Park,Solbee Cho,Jay-Yoon Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern language agents which perform multi-step reasoning have shown strong performance in knowledge-intensive question answering. However, existing approaches typically couple evidence acquisition and answer generation within a single policy. This forces a single model to play multiple potentially conflicting roles, inducing a combinatorial explosion in the policy space and hindering efficient exploration. It also introduces a credit assignment problem during training: a search action that retrieves sufficient evidence may still be penalized when generation fails, and vice versa. We propose DAC (Divide and Cooperate), a role-decomposed multi-agent training framework that divides agentic search into two cooperative subtasks, each handled by a dedicated agent trained with role-specific learning signals. The generator serves a dual role as both an answer producer and an evidence sufficiency verifier, abstaining when retrieved evidence is insufficient. This abstention signal is incorporated into the search agent’s reward, providing structured cross-agent learning signals that improve credit assignment. Conversely, the searcher exposes the generator to diverse and challenging evidence environments by hard-positive evidence augmentation, improving its robustness. Experiments on general and multi-hop QA benchmarks show that DAC, implemented via parameter-efficient LoRA modules over a shared backbone, achieves strong performance against prior baselines that rely on full fine-tuning of monolithic models.

[AI-49] In Defense of Information Leakage in Concept-based Models ICML2026

链接: https://arxiv.org/abs/2606.10669
作者: Mateo Espinosa Zarlenga
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted as a position paper at the Forty-Third International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Concept-based models (CMs), deep neural networks that ground their predictions on representations aligned with human-understandable concepts (e.g., “round”, “stripes”, etc.), have been shown to learn representations that leak concept-irrelevant information. As the traditional narrative goes, this leakage is undesirable and should be eradicated as it leads to uninterpretable models. In this paper, we posit that this conventional view of leakage in CMs is not only ill-posed, as the evidence of how leakage makes a model less interpretable is often inconclusive, but also bound to lead to impractical CMs under common real-world constraints. Specifically, we argue that in real-world settings where concept incompleteness is the norm, some leakage is often necessary for constructing accurate and intervenable CMs. To this end, we propose that there is such a thing as benign leakage and show that, by optimizing a reframing of the typical CM training objective, CMs can encourage and exploit this form of leakage without sacrificing accuracy or intervenability.

[AI-50] Accounting for AI Inference in Corporate GHG Inventories: A Four-Tier Methodology for Scope 3 Category 1 Reporting

链接: https://arxiv.org/abs/2606.10660
作者: Guillermo Llopis(SOMA AI, Barcelona)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Preprint. Data repository: this https URL . 18 pages, 3 figures, 6 tables

点击查看摘要

Abstract:AI inference services – API subscriptions, enterprise chat tools, and SaaS products with embedded AI features – fall unambiguously within Scope 3 Category 1 under the Corporate Sustainability Reporting Directive (CSRD), which requires disclosure for fiscal years starting January 2024. Yet no standardised methodology exists for including them in corporate GHG inventories. Current practice either omits the category entirely or applies a generic economic input-output (EEIO) factor calibrated to the ICT sector as a whole, overestimating AI inference emissions by 10-40x relative to physically derived alternatives. We propose a four-tier framework that matches estimation precision to the data organisations can realistically obtain, progressing from direct token-based physical estimation – using GPU energy benchmarks and regional grid carbon intensities – down to a spend-based EEIO fallback for services where no usage data exists. Emission factors are derived from peer-reviewed GPU energy benchmarks (this http URL Leaderboard v3), confirmed grid carbon intensities (EPA eGRID 2023; Ember 2023), and published water use effectiveness data (Li et al., 2025). Applied to a 200-person European firm, the framework yields a total below 1 tCO2e, illustrating that the compliance challenge is methodological rather than magnitude-driven. We further document a water-carbon trade-off that current ESG tools do not surface: Sweden’s hydro-dominated grid delivers the lowest carbon intensity in our dataset but the highest water footprint, with direct implications for data centre location strategy. Comments: Preprint. Data repository: this https URL. 18 pages, 3 figures, 6 tables Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.10660 [cs.CY] (or arXiv:2606.10660v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2606.10660 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-51] Post-Quantum Secure Federated DeFi for Inclusive Banking

链接: https://arxiv.org/abs/2606.10658
作者: Swati Sachan,Dale Fickett,Richard Buchinger,Theo Miller
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:Recent advances in error-corrected qubits have accelerated the timeline for practical quantum computing. It poses a threat to cryptographic primitives used to secure financial systems, government infrastructure, communication networks, and DeFi (Decentralized Finance) ecosystems. This paper introduces a post-quantum secure federated DeFi framework that enables inter-bank collaboration to improve the inclusivity of individuals underserved by local lenders due to limited financial histories. Multiple banks contribute encrypted information batches to a virtual server, where lattice-based Fully Homomorphic Encryption (FHE) enables end-to-end homomorphic computation. The server fuses local data-driven probabilistic assessments, expert beliefs, and verifiable evidence generated by the NASA-IBM Prithvi Geospatial Foundation Model (GFM), in encrypted format. Decentralized technologies are employed to ensure tamper-proof evidence and auditable accountability for all encrypted data exchanges between institutions and the server. The framework is tested on agricultural lending decisions for rural borrowers in Virginia.

[AI-52] Is Fairness Truly Fair? Towards Reliable Lipschitz Fairness in Multi-Task Learning via Fixed-texorpdfstringδdelta Alignment

链接: https://arxiv.org/abs/2606.10632
作者: Junbo Ding,Xin Zang,Chenchen Pan,Donghao Song,Jiaxin Zhu,Danhuai Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lipschitz-style individual fairness formalizes the idea that semantically similar examples should receive similar predictions, but its evaluation in multi-task learning (MTL) can be confounded by method-induced representation scales. This paper identifies threshold confounding: when the auditing tolerance is derived from each model’s own representation distances, different algorithms are compared under different semantic thresholds. A threshold-drift analysis further shows how Bias rankings can change and identifies sufficient conditions for ranking preservation. We propose \textbfReLiF, a reliability-aware framework that separates evaluation-time fixed- \delta auditing from training-time controlled regularization. ReLiF uses a shared reference tolerance for comparable auditing and a violation-rate feedback controller to keep the Lipschitz surrogate active without letting it dominate stochastic training. This work also develops supporting analysis for threshold drift, reference-tolerance selection, and the relationship between the huberized training surrogate and its unsmoothed positive-margin counterpart. Experiments on clinical time-series benchmarks and NYUv2 (NYU Depth V2) dense prediction show that fixed- \delta auditing exposes utility–fairness trade-offs that method-dependent thresholds can obscure. On NYUv2 with a ResNet50 backbone, ReLiF achieves competitive utility while substantially reducing aligned bias under shared fixed thresholds. On clinical benchmarks, ReLiF yields controlled fairness-regularized trade-offs, while fixed- \delta auditing reveals that task-balancing baselines can sometimes achieve lower bias and that genuine utility–fairness trade-offs persist. These results support fixed- \delta auditing as a semantically consistent protocol for evaluating Lipschitz fairness in MTL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.10632 [cs.LG] (or arXiv:2606.10632v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.10632 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3770855.3817938 Focus to learn more DOI(s) linking to related resources

[AI-53] Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

链接: https://arxiv.org/abs/2606.10616
作者: Qingcan Kang,Liu Mingyang,Shixiong Kai,Kaichao Liang,Tao Zhong,Mingxuan Yuan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, making memory retention a fundamental resource-allocation problem. Existing memory systems improve management through heuristic scoring, retrieval optimization, or learned compression, but largely treat retention as a local decision problem and do not explicitly model its long-term consequences under realistic observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization problem with explicit budget feasibility, evidence utility, and delayed costs including miss penalties, reacquisition delays, and stale-information risk. We then propose OSL-MR (Observability-Safe Learning for Memory Retention), a novel framework that enforces a strict separation between online-observable features and offline-available supervision (OAS). OSL-MR combines an evidence learner trained from realized evidence supervision with a Mixed-Score heuristic that serves both as a deployable online-safe baseline and as a structured inductive prior for learning. The resulting policy learns query-conditioned evidence value directly from interaction data while remaining deployable under the same observability constraints. Experiments on LOCOMO and LongMemEval show that OSL-MR consistently outperforms recency-based methods, Generative Agents-style scoring, and other heuristic baselines, particularly under tight memory budgets. The Mixed-Score prior further improves precision while preserving recall, and sensitivity analysis demonstrates robustness across a wide range of cost configurations.

[AI-54] Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning ICML2026

链接: https://arxiv.org/abs/2606.10613
作者: Thanh Nguyen,Tri Ton,Hongbin Choe,Tung M. Luu,Chang D. Yoo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026, 19 pages

点击查看摘要

Abstract:Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference, without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.

[AI-55] Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation

链接: https://arxiv.org/abs/2606.10601
作者: Anirudh Kalyan,Cosmin Anitescu,Xiaoying Zhuang,Timon Rabczuk,Somdatta Goswami,Sundararajan Natarajan
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating high-quality meshes for arbitrary geometries remains a fundamental bottleneck in computational engineering, often demanding heuristic tuning and semi-manual workflows. In this paper, we introduce Dmsh, a first fully automated reinforcement learning pipeline that unifies geometric decomposition and quadrilateral mesh generation within a single learning-based framework. Dmsh decomposes the problem through three coordinated agents handling topology simplification, geometric regularization, and mesh generation. The meshing process is formulated as a Markov Decision Process and solved using a parametric Soft Actor-Critic architecture with decoupled critics, enabling efficient exploration of a hybrid discrete-continuous action space. A curriculum learning strategy ensures scalability from simple domains to highly complex geometries, suppressing seed variance. By design, the recursive decomposition enables parallel meshing of subregions, yielding globally conforming all-quadrilateral meshes without post hoc correction. Across a wide range of benchmarks, Dmsh consistently outperforms existing methods in automation, robustness, and mesh quality, establishing a new paradigm for learning-based mesh generation.

[AI-56] Embedding Hybrid Systems into Continuous Latent Vector Fields ICML2026

链接: https://arxiv.org/abs/2606.10596
作者: Sangli Teng,Hang Liu,Koushil Sreenath
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:This work proves that an n -dimensional hybrid system can be embedded into an m -dimensional Euclidean space equipped with a continuous vector field on its embedded image whenever m2n . This result suggests that an intrinsically discontinuous hybrid system generically admits a continuous extrinsic representation that is well-posed for differentiable optimization. Building on this existence theorem, we show that a latent Neural ODE with consistency loss in both the latent and state space can accurately recover the flow of hybrid systems. Extensive experiments suggest the proposed method outperforms the existing method in learning hybrid systems with varying geometries from only time series data.

[AI-57] From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning

链接: https://arxiv.org/abs/2606.10595
作者: Huong Nguyen,Mickaël Bettinelli,Amirhossein Ghaffari,Alexandre Benoit,Hong-Tri Nguyen,Susanna Pirttikangas,Lauri Lovén
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising solution for data hunger in centralized learning. This paradigm enables privacy with multiple clients to train a shared-task model collaboratively without exposing their local data. While being a key component in any learning system, data is also a primary source of vulnerabilities and challenges, and a major determinant of a stable and well-converged training. Existing FL reviews describe general foundations, security practices, opportunities, challenges, and applications, without delving into diverse aspects of data and considering problems from the data perspective. They rarely provide a data-lens synthesis that links concrete data properties, split protocols, and defenses to convergence speed and stability. This survey fills that gap with three advances. First, we analyze non-IID into measurable traits and rank their influence on convergence as strong, medium, or light, explaining the mechanisms behind each and reconciling evidence across images, texts, and graphs. Second, we connect experimental splitting practices to the real phenomena they emulate, expose the artifacts they introduce, and show how those artifacts affect target accuracy. Third, we analyze how data-related vulnerabilities and their proposed defenses affect convergence, reporting performance under clean and adversarial conditions to make the convergence-robustness trade-off explicit. To our knowledge, this is the first survey to provide a complete understanding of data-related challenges that govern FL. With clear takeaways distilled for each concern, our work serves as actionable guidance, helping practitioners design their system with predictable convergence and stability.

[AI-58] owards Diverse Scientific Hypothesis Search with Large Language Models ICML2026

链接: https://arxiv.org/abs/2606.10587
作者: Haorui Wang,Parshin Shojaee,Kazem Meidani,Kunyang Sun,José Miguel Hernández-Lobato,Teresa Head-Gordon,Jiajun He,Chandan K. Reddy,Chao Zhang,Yuanqi Du
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) are on the rise for accelerating scientific discovery, most recently in advanced tasks such as generating valid scientific hypotheses. Yet in many discovery settings, the goal is not to identify a single best hypothesis since validation can be noisy and expensive, and scientists benefit from a set of high-quality alternative hypotheses that hedge against downstream uncertainty for the best solutions. Nevertheless, commonly used evolutionary search recipes tend to prioritize optimization over exploration in hypothesis generation, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under a fixed validation budget. Building on this perspective, we propose \ours, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidates that remain robust under more expensive downstream computational validations.

[AI-59] NOVA: Symbolic Regression Discovery of Interpretable Car-Following and Lane-Change Models with Driver Heterogeneity

链接: https://arxiv.org/abs/2606.10583
作者: Ishak Abassi,Nassim Ali Bouazzouni,Farah Ibelaiden,Nadir Farhi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We present NOVA, an autonomous symbolic regression framework that identifies interpretable car-following and lane-change structures from raw trajectory data with minimal behavioral priors. Applied to 4,765,788 active driving observations from the NGSIM I-80 and US-101 datasets, NOVA’s deterministic Rust-powered search engine evaluates over 10,000 candidate algebraic structures and identifies a compact two-term acceleration model under a forward-shifted rolling-mean prediction target. Evaluated under two complementary preprocessing pipelines, NOVA achieves RMSE = 1.376 m/s^2 ( R^2 = 15.57% ) on the intent-forecasting benchmark, outperforming the best recalibrated symbolic-regression baseline (SR-LLM, PNAS~2025) by 0.135 m/s ^2 in RMSE under an identical evaluation protocol. Across eight independent experiments, a single dominant nonlinear term emerges as a robust backbone of human car-following; a residual-guided extension further links the selected structure to an established psychophysical theory of collision avoidance. The discovered feature operators transfer zero-shot between freeway sites with under 3 pp R^2 loss. Extended to lane-change modelling within a multinomial logit framework, NOVA achieves 67.4% balanced accuracy under strict vehicle-ID holdout on 502 unseen drivers, surpassing existing lane-changing baselines by +29.8 percentage points on a three-class problem.

[AI-60] Drawing with Strangers: Population Scaling Drives Zero-Shot Mutual Intelligibility in Emergent Sketching

链接: https://arxiv.org/abs/2606.10582
作者: Jooyeon Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalization in emergent communication has largely focused on novel inputs or linguistic structures, yet the capacity for agents to communicate with strangers from strictly disjoint communities remains relatively unexplored. In this work, we formalize this capability as \textitzero-shot mutual intelligibility (ZMI): successful communication between independently trained populations without prior exposure. Leveraging emergent sketching – in which agents communicate through sets of drawn strokes – as a visually grounded modality, we find that scaling the training population substantially improves ZMI across independent groups. Crucially, as we scale the population size, in-group communicative variation increases, preventing co-adaptation into homogeneity. Simultaneously, cross-group variation decreases, indicating a structural convergence toward a certain type of universality. Further analysis reveals that this universality is achieved through perceptual grounding: scaled populations increasingly anchor their emergent sketches on the objective visual resemblance of the target images. Together, these results position ZMI as a distinct axis of generalization in emergent communication and suggest a route toward socially interoperable artificial agents.

[AI-61] Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State-Action Updates

链接: https://arxiv.org/abs/2606.10580
作者: Octave Oliviers,Glenn Vinnicombe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The asymptotic behaviour of Monte Carlo optimistic policy iteration (MC-O-PI) is a long-standing open question. When the model of the environment is unknown, as is common in practice, the only known condition that guarantees convergence to optimality is impractical. In its canonical form, this condition requires that the episodes used for policy evaluation be initialised uniformly over the entire state-action space. This paper strictly relaxes that requirement. Specifically, we prove that initial-visit MC-O-PI converges to optimality even when updates are uniform only over the actions within each state. This allows episodes to start in different states at arbitrary frequencies; a realistic implementation when the state space is large or unknown but the action space in each state is manageable. The proof departs from the classical analysis of Tsitsiklis whose central commutativity argument no longer applies when states are updated at different frequencies. Instead, we first show that the mean-field dynamics of MC-O-PI generate monotonically improving policies when updates are uniform over the actions in each state, and then prove that noise cannot consistently prevent this improvement by extending the lock-in argument of the combined stability-ODE method. This approach suggests a new way to study optimistic policy-iteration algorithms in general.

[AI-62] One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

链接: https://arxiv.org/abs/2606.10572
作者: Zhi Zheng,Ziqiao Meng,Hao Luan,Wei Liu,Wee Sun Lee
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at this https URL.

[AI-63] Flexible Flows for Biological Sequence Design

链接: https://arxiv.org/abs/2606.10543
作者: Yogesh Verma,Dani Korpela,Harri Lähdesmäki,Vikas Garg
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.

[AI-64] ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

链接: https://arxiv.org/abs/2606.10532
作者: Yunhan Jiang,Wenbin Duan,Shasha Guo,Liang Pang,Xiaoqian Sun,Huawei Shen
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memory is essential for enabling large language model (LLM) agents to handle long-horizon reasoning tasks. Existing memory mechanisms are largely centralized, typically organizing retrieved information and interaction history within a single model context. This design imposes a fundamental trade-off: scaling reasoning trajectories risks context overload, whereas aggressive content pruning may result in irreversible information loss. Seeking a better trade-off, we draw inspiration from human cognitive systems, especially the functional complementarity between the prefrontal cortex (executive control) and the hippocampus (memory management), suggesting that such a trade-off need not be inherent, but may instead stem from centralized memory organization. To this end, we propose ActiveMem, a heterogeneous framework that decouples agent memory from the core reasoning process. Specifically, a high-level Planner utilizes distilled semantic gists to execute reasoning, while a lightweight, distributed memory system operates in parallel to actively accumulate and consolidate these gists throughout the task. Experiments on BrowseComp-Plus and GAIA show that ActiveMem achieves state-of-the-art accuracy with significantly reduced overhead, demonstrating the effectiveness of distributed active memory for long-horizon reasoning.

[AI-65] Machine Learning Methods for Studying Latent Neural Activity Dynamics IJCAI2026

链接: https://arxiv.org/abs/2606.10530
作者: Shufeng Kong,Fumei Deng,Xinyi Dong,Caihua Liu,Weiwei Chen,Yingheng Wang,Daniel Cao,Azahara Oliva,Antonio Fernandez-Ruiz,Carla Gomes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2026 survey track

点击查看摘要

Abstract:Recent developments in brain recording are driving a demand for machine learning tools capable of decoding the latent structure of large populations of neurons. In this paper, we provide a comprehensive survey that outlines the trajectory of Latent Variable Models (LVMs) from early state-space models to more recent deep generative models. We organize the literature into three closely related domains: (1) Single-Region Latent Dynamics, which includes models such as linear dynamical systems to more complex dynamics represented by Recurrent Neural Networks (RNNs) and Neural Ordinary Differential Equations (ODEs); (2) Multi-Region Communication, which employs probabilistic as well as subspace methods to study how information is transferred across different brain areas considering synaptic propagation delays and network connectivity; and (3) Behavior-Aligned Modeling, which seeks to disentangle neural activity related to task performance from other internal states via supervised or contrastive learning. This survey also includes large-scale neural foundation models, such as Transformers and diffusion models, that rely on large-scale pre-training for optimal performance across subjects. Finally, we conclude and discuss benchmarks, evaluation criteria, and open challenges, such as the ability to identify causal links or directionality of communication, to facilitate future research for bridging interpretable brain dynamics with reliable neural decoding.

[AI-66] Assessing Automated Prompt Injection Attacks in Agent ic Environments

链接: https://arxiv.org/abs/2606.10525
作者: David Hofer,Edoardo Debenedetti,Florian Tramèr
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Indirect prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods–proven effective for jailbreaking–remain underexplored in realistic agentic settings. We present a comprehensive empirical evaluation of automated prompt injection attacks against LLM agents, adapting both white-box (GCG) and black-box (TAP) methods to the agentic setting within the AgentDojo framework. We evaluate across 80 task pairs spanning four domains and multiple models, and find that black-box optimization substantially outperforms gradient-based methods, a gap we attribute to GCG’s optimization instability under reasonable compute budgets. We also find that TAP’s effectiveness depends on the attacker model, as both general capability and safety tuning affect attack success–stronger models produce more effective injections, while safety-tuned attackers can refuse to generate adversarial prompts. Task-universal attacks transfer effectively to unseen tasks and out-of-distribution domains, but attacks optimized on smaller open-source models do not transfer to frontier models like GPT-5. These findings highlight automated prompt injection as a credible but model-dependent threat, with significant barriers remaining for model-agnostic exploitation.

[AI-67] HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

链接: https://arxiv.org/abs/2606.10507
作者: Juncheng Diao,Zhicong Lu,Peiguang Li,Yongwei Zhou,Changyuan Tian,Qingbin Li,Rongxiang Weng,Jingang Wang,Xunliang Cai
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to alleviate long-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long-term dependency. However, these methods still do not directly address long-context interference, in which continuously growing histories weaken the agent’s ability to track the global task state and impair subsequent reasoning and decision-making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. Furthermore, to stabilize subgoal-based planning and execution, HIPIF combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.

[AI-68] Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

链接: https://arxiv.org/abs/2606.10504
作者: Trong Khiem Tran,Anh Duc Chu,Quang Hung Pham,Phi Le Nguyen,Trong Nghia Hoang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable. In particular, we establish a cross-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.

[AI-69] A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis

链接: https://arxiv.org/abs/2606.10500
作者: Mingyuan Liu,Dan Yin,Zongzong Wu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In equipment operation, the implementation of fault diagnosis is essential to ensure the continuity and safety of production equipment, improve operational efficiency and reduce maintenance costs. Since sensor readings are widely used for fault diagnosis, their reliability directly affects the results of fault diagnosis. A new fault diagnosis method is proposed to address the two problems of robustness assessment and robustness optimization of fault diagnosis models. For this purpose, a reliable fault diagnosis method based on a belief rule base (BRB) considering robustness analysis is proposed. Firstly, the robustness analysis of the BRB model is carried out systematically. Secondly, three robustness constraint strategies are proposed to optimize the robustness of the BRB fault diagnosis model. Finally, the effectiveness of the proposed model is verified by taking the fault diagnosis of WD615 diesel engine and Case Western Reserve University bearings as an example, and the experiments show that the proposed model improves both accuracy and robustness.

[AI-70] MoE Enhanced Federated Learning for Spatiotemporal Prediction

链接: https://arxiv.org/abs/2606.10499
作者: Zhehao Dai,Xiao Han,Zhaolin Deng,Zijian Zhang,Xiangyu Zhao,Guojiang Shen,Xiangjie Kong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic prediction is fundamental to intelligent transportation systems and urban computing, yet many cities continue to suffer from traffic data scarcity due to limited sensor deployment and uneven urban development. Cross-city knowledge transfer has thus attracted increasing attention, enabling data-rich cities to assist data-scarce ones. However, centralized approaches raise privacy concerns, while existing federated methods struggle with pronounced spatiotemporal heterogeneity across cities. To address these challenges, we propose MoE-FedTP, a personalized federated cross-city spatiotemporal prediction framework based on lightweight Mixture-of-Experts (MoE) networks. MoE-FedTP first employs spatiotemporal neural networks to extract features from both source and target cities, then introduces a set of expert networks derived from different source cities through partial parameter sharing. A gating mechanism dynamically fuses the experts to capture diverse traffic dynamics, achieving fine-grained modeling of urban heterogeneity while preserving privacy. Experiments on four real-world traffic datasets show that MoE-FedTP consistently outperforms state-of-the-art cross-city and federated learning baselines, demonstrating its effectiveness in enhancing prediction accuracy for data-scarce cities.

[AI-71] Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design OSDI’26

链接: https://arxiv.org/abs/2606.10493
作者: Wenxin Wang,Yule Hou,Yu Ji,Peng Qu,Youhui Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Accepted to the 20th USENIX Symposium on Operating Systems Design and Implementation (OSDI '26). The official version will appear in the OSDI '26 proceedings published by USENIX

点击查看摘要

Abstract:Local deployment of large Mixture-of-Experts (MoE) models falls short of the service quality achieved in cloud-scale environments, even under low-concurrency workloads. We identify four key gaps in local MoE inference: reliance on capacity-reduced models (quantized, distilled, rerouted), inability to meet 30-second TTFT for long prefills (more than 12K), sub-baseline decode throughput (under 20 tokens/s), and poor concurrency under mixed prefill-decode and batched decode workloads. We present a CPU-GPU hybrid system that achieves cloud-level SLOs on dual-socket commodity CPUs and consumer GPUs by (1) stream-loading prefill (SLP), boosting prefill throughput to 1,200 tokens/s and enabling 32K prompts within 30 seconds; (2) distributed SLP (DSLP) with SmallEP expert parallelism, reaching 1,800 tokens/s and 45K prompts in 30 seconds on two RTX 5090s; (3) intra-node prefill-decode disaggregation with zero-copy shared weights and a dual-batch attention-MoE overlap scheme, sustaining concurrency with under 15 percent latency increase and 50 percent throughput gains; (4) an AVX-512-optimized FP8 GEMV kernel, enabling native CPU FP8 inference while delivering 4-5x lower CPU latency; and (5) fine-grained CPU parallelism that attains 28 tokens/s on INT4 DeepSeek-V3 and 21.5 tokens/s on intact FP8 V3. Evaluations show our system delivers cloud-level QoS for flagship MoE models on consumer CPU-GPU platforms, reshaping local deployment with intact, original-precision inference and enabling high-quality, cost-effective access without datacenter infrastructure.

[AI-72] A complementary study on PlanGPT : Evaluation with defined Performance Metrics and comparison with a planner

链接: https://arxiv.org/abs/2606.10489
作者: Youssef Abdelkader,Humbert Fiorino,Damien Pellier
类目: Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:Automated Planning is a subfield of Artificial Intelligence (AI) where the main objective is generating a sequence of actions, known as a plan, that helps us reach a goal state from an initial state. A planning problem is defined by a set of objects, an initial state and a desired goal state. The objective is to compute a plan that’ll lead us from the inital state to the goal state. Programs that generate plans are called planners. In this paper, we did a complementary study to the state-of-the-art LLM called PlanGPT which was released last year. We redid some experiments to verify whether planning with LLMs is \textbfpertinent and \textbfworthwhile. We also check whether the results obtained in the official PlanGPT paper for plan coverage were correct, and we also performed a more comprehensive study on PlanGPT’s performance: in our paper PlanGPT’s performance was evaluated using two metrics: Plan Cost and Plan Generation Time. The results of planGPT were compared to those produced by a traditional planner for the same plans and same metrics. We discovered that PlanGPT is no better than a Greedy search strategy. Comments: 7 pages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.10489 [cs.AI] (or arXiv:2606.10489v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.10489 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-73] Stop Early Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs

链接: https://arxiv.org/abs/2606.10487
作者: Huizhen Shu,Xuying Li,Piao Xue
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Technical Report. 14 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Deploying large language models in user-facing systems requires efficient output safety filtering. Existing approaches typically rely on a separate moderation model applied after generation, which doubles inference cost and only detects violations after generation completes. We observe that the signal needed for moderation is already present in the model hidden states. Based on this, we train lightweight token-level probes that operate directly on internal activations, producing per-token safety scores that can be aggregated for both offline evaluation and online intervention. The probe reuses activations from the generator and requires no additional forward pass, enabling sub millisecond per-token safety checks inside the decoding loop. A probe applied to a single mid layer recovers most decisions of a strong guard model, acting as a low cost surrogate optimized for latency rather than accuracy. In streaming settings, it can halt or modify unsafe outputs before they are fully generated, replacing end of sequence moderation with continuous token level monitoring. Compared to post hoc and streaming guard models, our method achieves orders of magnitude lower compute overhead with minimal latency cost. We also provide a practical deployment recipe, including layer selection, aggregation strategy, probing frequency, and triggering thresholds. Finally, we show that the probe linear component corresponds to a direction in residual space, enabling both detection and activation steering at negligible cost.

[AI-74] ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

链接: https://arxiv.org/abs/2606.10479
作者: Shunkai Zhang,Haoran Zhang,Yun Luo,Qianjia Cheng,Haodi Lei,Yizhuo Li,Runzhe Zhan,Zhilin Wang,Bangjie Xu,Yucheng Su,Xinmiao Han,Xiaoye Qu,Dongrui Liu,Zhouchen Lin,Yu Qiao,Ning Ding,Yafu Li,Yu Cheng
类目: Artificial Intelligence (cs.AI)
备注: 39 pages, 6 figures, 26 tables. Project page: this https URL

点击查看摘要

Abstract:Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today’s strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

[AI-75] UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

链接: https://arxiv.org/abs/2606.10466
作者: Du Yin,Hao Xue,Jinliang Deng,Yang Yang,Shuang Ao,Arian Prabowo,Flora Salim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In time-series generation, existing approaches typically handcraft ortrain a separate model for each dataset, which hinders their scalability and fails to leverage shared temporal structures across domains. To address this fragmentation, we propose UPLOTS, a Unified, Prompt-guided Language model framework fOr constrained Time-Series Generation across diverse domains. Instead of building task-specific models, UPLOTS leverages a single pre-trained transformer backbone guided by learned constraint prompts, enabling on-demand generation with precise pattern control. One key innovation is our dynamic multi-dataset loss re-weighting and prompt-to-pattern mapping, which allows UPLOTS to internalize diverse temporal structures during training and conditionally generate them at inference. We evaluate UPLOTS on four real-world benchmarks and multiple constraint settings, including peak-period, calendar, load-level, and volatility patterns. Additional held-out constraint-combination and downstream forecasting experiments further demonstrate that UPLOTS generalizes beyond the original peak-pattern setting and improves data augmentation under scarce real-data regimes. Our code and baselines are available at anonymous github repo: this https URL.

[AI-76] Minimum Distortion Quantization with Specified Output Distribution

链接: https://arxiv.org/abs/2606.10458
作者: Aolin Xu
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:We derive the optimal quantizer of a real-valued random variable W with distribution P_W such that 1) the distribution of the quantization output X that can take k values follows any specified distribution P_X over \1,\ldots,k\ , and 2) the minimum mean squared error (MMSE) of estimating W from X is minimized. It is shown that the optimal quantizer takes the form X=\sigma\big(F_\sigma^-1(X)^-1(F_W(W))\big) , where \sigma is the optimal permutation of \1,\ldots,k\ among all permutations to minimize the MMSE, and F is the cumulative distribution function. When P_W is uniform over an interval or P_X is uniform over \1,\ldots,k\ , the quantizer takes a simple form X=F_X^-1(F_W(W)) . The concept of majorization plays a key role in the optimality proof. Specifying the output distribution is useful for designing quantizers with explicitly controlled output entropy, maximized mutual information between input and output, tailored output distribution to match channel input requirements for communication, and data anonymization.

[AI-77] race2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

链接: https://arxiv.org/abs/2606.10457
作者: Junli Zha,Jinbo Wang,Chao Zhou,Xiang Song
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision rules that enterprise experts apply tacitly – in auditing, compliance, and contract review – can be systematically recovered and improved through iterative error analysis. We present \textbfTrace2Policy, whose core mechanism – \textbfEISR (\textbfError-driven \textbfIterative \textbfSkill \textbfRefinement) – maintains a human-readable rule document as its optimization target: each round executes the rules on a validation set, clusters errors by root cause into MISSING, WRONG, or CONFLICT types, applies targeted patches, and commits only those that pass a regression gate. \textbfFor this class of compliance-sensitive, skewed-base-rate decision tasks, we identify rule quality – not model capability – as the dominant performance lever: across five LLMs, one-shot distillation plateaus near \sim 70% on the deployed pool, while eight EISR rounds lift the same rules to 79.6% when compiled into deterministic Python – zero LLM calls at inference. \textbfExecution form compounds the gain: in production, the same EISR-refined content runs 9.8~pp higher as compiled Python than as an LLM prompt, a form-and-engineering bundle the 22-day deployment matured together. Deployed for 22 days at a major logistics carrier (3,349 audit cases), the compiled pipeline outperforms the pure-LLM baseline it replaced (72.7%); on these calibrated, skewed-base-rate workloads, re-enabling LLM fallback monotonically degrades accuracy. An LLM-driven variant, \textbfAuto-EISR, reproduces this refinement at \ 5–\ 10 per cycle versus \sim 70 expert-hours, and transfers to four public benchmarks spanning legal reasoning (LegalBench) and process-mining decisions (BPIC 2012) without re-engineering.

[AI-78] he Distributed Detectability Band Against Marginal-Preserving Attacks

链接: https://arxiv.org/abs/2606.10456
作者: Zhang Qinqin,Gao Yuze
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 11 figures

点击查看摘要

Abstract:AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A’s score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.

[AI-79] Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations

链接: https://arxiv.org/abs/2606.10448
作者: Zeyu Liu,Xuanzhi Feng,Sing Kwong Lai,Yuanchen Gao,Xiaoyi Pang,Hualei Zhang,Jingcai Guo,Jie Zhang,Song Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Code available at this https URL

点击查看摘要

Abstract:The financial market is a typical low signal-to-noise ratio (SNR) setting, which often destabilizes off-policy maximum-entropy methods like Soft Actor-Critic (SAC). Specifically, noisy state representations may produce unreliable Q-value estimates, and bootstrapping amplifies these errors, forming a failure mode we call the “Financial Entropy Trap”. In this paper, we propose FPQC-SAC, an efficient and plug-and-play SAC variant that places a compact and bounded Parameterized Quantum Circuit (PQC) before the actor and critic networks to constrain feature propagation at the representation level, rather than filtering raw inputs or regularizing Q-values after bootstrapping. Notably, FPQC-SAC reduces the impact of extreme market fluctuations on Bellman target estimation, while trainable quantum entanglement preserves flexible cross-asset interactions. Empirical evaluations on real-world portfolio management tasks demonstrate that FPQC-SAC substantially enhances out-of-sample stability and cumulative returns by achieving a 66.89% relative gain in cumulative return over standard unconstrained SAC and outperforms the best continuous-control deep reinforcement learning baseline by approximately 27%. Open-source code is available at this https URL.

[AI-80] Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness

链接: https://arxiv.org/abs/2606.10413
作者: Jinshan Zhang,Xishi Zhou,Qiu Peng,Jianwei Yin
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Breakthroughs in large language models and multimodal generation technologies have propelled the digital reconstruction of human mental traits, emotional patterns, and long-term memory from science fiction toward engineering practice. Yet current research and industry practices at the intersection of AI and digital humans remain hampered by fundamental conceptual ambiguities: the essential differences between next-generation intelligent agents and traditional virtual humans, the construction pathways for digital entities possessing self-identity, and the core technical and ethical challenges confronting this domain all demand urgent clarification. This paper systematically examines the transformative logic underlying the transition from traditional virtual humans to the Soul Computing'' paradigm, driven by frontier AI technologies. We first analyze the evolutionary patterns of human consciousness and memory mechanisms, reassessing the core value of massive multimodal digital fragments in the reverse reconstruction of individual mental worlds. On this basis, we formally delineate the academic connotations of narrow and broad Soul Computing for the first time, clarifying its academic boundaries and essential distinctions from Affective Computing, Historical Reconstruction, and Mortal Computation. We argue that Soul Computing systems must architecturally construct an Intensional’’ core rather than serving as purely ``Extensional’’ functional carriers, thereby enabling the fundamental transition of AI from toolhood to living agency.

[AI-81] A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning High-Frequency Trading and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis

链接: https://arxiv.org/abs/2606.10412
作者: Fanrong Liu,Zhang Yuwei,Mingni Luo
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution of financial technology demands sophisticated artificial intelligence systems capable of handling diverse challenges across multiple domains simultaneously. This paper presents a groundbreaking unified framework that seamlessly integrates Proximal Policy Optimization for robo-advisory systems, advanced time-series prediction models for high-frequency trading, in-context learning mechanisms for dynamic investment advisory, game-theoretic approaches for competitive banking scenarios, and unified embeddings for cross-modal financial sentiment analysis. Our comprehensive framework addresses the critical gap in existing literature where these technologies have been developed in isolation, failing to leverage their synergistic potential. Through extensive experimentation across multiple financial datasets and real-world scenarios, we demonstrate that our integrated approach achieves superior performance compared to specialized single-domain systems. Specifically, our framework shows a 23.7% improvement in portfolio optimization metrics, reduces prediction error in high-frequency trading by 31.2%, enhances investment recommendation accuracy by 18.9%, optimizes competitive banking strategies with a 27.4% increase in Nash equilibrium convergence speed, and improves sentiment analysis accuracy by 15.6% through cross-modal fusion. The theoretical foundation of our work establishes convergence guarantees for the integrated optimization problem, while our empirical results validate the practical applicability across diverse financial institutions. This research not only advances the state-of-the-art in financial AI but also provides a blueprint for developing comprehensive intelligent systems that can adapt to the complex, interconnected nature of modern financial markets.

[AI-82] FOGO: Forgetting-aware Orthogonalization Optimizer

链接: https://arxiv.org/abs/2606.10406
作者: Toan Nguyen,Yang Liu,Trung Le,Celso de Melo,Flora D. Salim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We argue that forgetting is not confined to continual learning but is a general optimization phenomenon: during standard training, dominant mini-batch gradients suppress rare but useful update directions, causing short-term forgetting at every step. When such knowledge is never revisited, these losses compound into long-term forgetting-the classical failure mode of continual learning. We introduce FOGO, a scalable optimizer that continuously detects and resolves gradient interference across both regimes. FOGO spectrally orthogonalizes momentum updates to prevent dominant directions from monopolizing optimization, then stores representative past directions in a compact codebook memory built on random projection, where pairwise distances are provably preserved in low-dimensional space. At each step, conflicts between the current update and stored directions are resolved via lightweight orthogonal correction and lifted back through a proximal step, with minimal overhead and no data storage. Across class-imbalanced classification, continual visual learning under domain and class shifts, continual fine-tuning of LLaVA-7B, and GPT-2 pretraining, FOGO consistently improves convergence and knowledge retention, outperforming Adam and Muon.

[AI-83] STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

链接: https://arxiv.org/abs/2606.10394
作者: Sirui Liang,Bohan Yu,Peiyu Wang,Shiguang Guo,Wenxing Hu,Pengfei Cao,Jian Zhao,Cao Liu,Ke Zeng,Xunliang Cai,Kang Liu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.

[AI-84] Instruction Finetuning DeepSeek -R1-8B Model Using LoRA and NEFTune

链接: https://arxiv.org/abs/2606.10392
作者: Wu Yuerong,Mingni Luo
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Financial named-entity recognition (NER) is essential for translating unstructured financial reports and news into structured knowledge graphs. However, general-purpose large language models (LLMs) often misclassify financial entities or ignore domain-specific patterns. This paper investigates the use of DeepSeek-R1-8B, a recent open-source large language model, combined with Low-Rank Adaptation (LoRA) and Noisy Embedding Fine-Tuning (NEFTune) for financial NER. Each annotated sentence in our corpus of 1693 samples is converted into an instruction-input-output triple. We insert lightweight LoRA matrices into the Transformer layers and apply NEFTune to improve generalisation by adding uniform noise to embedding vectors during training. Experiments show that the LoRA-adapted DeepSeek-R1-8B achieves a micro-F1 of 0.901 on seven entity types (Company, Date, Location, Money, Person, Product and Quantity), and adding NEFTune further boosts the micro-F1 to 0.912, outperforming Llama3-8B, Qwen3-8B, Baichuan2-7B, T5 and BERT-Base baselines.

[AI-85] Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM -Driven Strategy Evolution in Adversarial Games

链接: https://arxiv.org/abs/2606.10389
作者: Haoran Li,Zengle Ge,Ziyang Zhang,Xiaomin Yuan,Yui Lo,Qianhui Liu,Bocheng An,Dongke Rong,Jiaqun Liu,Annan Li,Jianmin Wu,Dawei Yin,Dou Shen
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few-game scores with statistically reliable assessments; and weakness pressure, which dynamically up-weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation-model code-evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture-the-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies – including lookahead search and adaptive interception – demonstrating that code-level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU-evolved strategy further achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real-world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: this https URL

[AI-86] Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

链接: https://arxiv.org/abs/2606.10385
作者: Wenhao Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures. Project page: this https URL

点击查看摘要

Abstract:On-policy distillation (OPD) has demonstrated strong empirical gains in enhancing complex reasoning in LLMs by aligning a student model with a teacher’s predictive distribution over the student’s own trajectories. An emerging variant, Privileged OPD, further strengthens this paradigm by employing a self-teacher model augmented with privileged information, such as oracle traces, to mitigate teacher-student capacity gaps while providing dense, answer-directed supervision. However, current methods treat privileged information as a monolithic imitation target, failing to disentangle locally reachable reasoning steps from future-conditioned oracle signals. Consequently, the student is encouraged to match a hindsight-biased distribution that often falls outside its local predictive support. This reachability mismatch incentivizes the student model to skip valid intermediate reasoning in favor of locally unsupported shortcuts. To resolve this, we introduce Anchored Residual On-Policy Distillation (AR-OPD), a dual-view framework that disentangles privileged supervision. Rather than enforcing strict full-view imitation, AR-OPD establishes a locally compatible anchor using a partially privileged teacher, isolating and injecting oracle foresight as a controlled residual to provide destination-directed guidance. Across diverse reasoning tasks, AR-OPD outperforms full privileged OPD by 2.3 points and SFT by 7.9 points. Crucially, this anchored residual mechanism reduces hindsight leakage by 21.7% and mitigates late-stage drift, yielding up to a 7.2-point advantage on challenging long-horizon trajectories exceeding 768 tokens.

[AI-87] Belief-Space Control for Personalized Cancer Treatment via Active Inference

链接: https://arxiv.org/abs/2606.10376
作者: Deniz Sargun,H. Bugra Tulay,C. Emre Koksal
类目: Artificial Intelligence (cs.AI)
备注: 11 pages including appendix

点击查看摘要

Abstract:Cancer treatment is at the core a sequential decision-making problem with partial observability, latent patient heterogeneity, and explicit constraints on the budget for medical measurements. Unlike standard Reinforcement Learning (RL) approaches that control state trajectories, cancer treatments permanently modify patients’ transition dynamics, changing how states evolve over time. We model cancer treatment as a belief-space planning problem using active inference, deriving an expected free-energy objective that unifies goal-directed control and information acquisition under measurement budgets without. We implement this framework using real clinical cancer data from the AACR Project GENIE Biopharma Collaborative dataset. Results on clinical data demonstrate a simultaneous patient categorization and high treatment efficacy, under real measurement and treatment constraints.

[AI-88] st-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

链接: https://arxiv.org/abs/2606.10371
作者: Zi Yin,Peilin Chai,Siyuan Huang,Zhanhao Hu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at this https URL.

[AI-89] Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

链接: https://arxiv.org/abs/2606.10368
作者: Xuanchen Li,Tianrui Wang,Yuheng Lu,Zikang Huang,Yu Jiang,Chenghan Lin,Chenrui Cui,Ziyang Ma,Xingyu Ma,Chunyu Qiang,Guochen Yu,Xie Chen,Longbiao Wang,Jianwu Dang
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at this https URL.

[AI-90] A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation

链接: https://arxiv.org/abs/2606.10366
作者: Shuo Wang,Hanyuan Xu,Yingdong Hu,Fanqi Lin,Yang Gao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. In this work, we investigate this issue through the lens of sim-and-real correlation. We conduct a systematic study across multiple simulation platforms, VLA policies, tasks, and perturbation factors, measuring whether simulated evaluation preserves real-world conclusions in terms of policy ranking consistency, performance correlation, and perturbation-wise failure patterns. This analysis allows us to characterize the limitations of existing simulators and identify what kinds of simulation signals are more aligned with real-world deployment. We further examine how users should exploit simulation for policy improvement, including when simulator-based finetuning is beneficial and how the amount of post-training data affects sim-and-real alignment. Overall, our work provides a unified framework for measuring, interpreting, and improving the usefulness of simulation for VLA policies, offering guidance both for simulator designers and for practitioners who use simulation as part of the policy development pipeline.

[AI-91] ReflectiChain: Epistemic Grounding in LLM -Driven World Models for Supply Chain Resilience

链接: https://arxiv.org/abs/2606.10359
作者: Jia Luo
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents in supply chains face a fundamental epistemic gap: large language models (LLMs) interpret policies but lack physical grounding, while reinforcement learning (RL) optimizes flows but is semantically blind to unstructured constraints. We introduce REFLECTICHAIN, bridging this gap through a Generative Supply Chain World Model (SC-WM) - encoding heterogeneous supply networks into a 6-dim graph-latent space with physical conservation - and Double-Loop Learning that separates epistemic uncertainty (KL-trust-region-bounded policy adaptation) from aleatoric uncertainty (stochastic latent rollouts). On Semi-Sim, a 10-node semiconductor benchmark with SIR risk propagation, 6 perturbation types, and 10 policy constraint templates, REFLECTICHAIN improves Rationale Consistency Score by 33.0% (p 0.0001, d = 2.78), maintains 82.3% operability under adversarial shocks, and exhibits anti-fragile behavior (+40.2% gain under moderate pressure). We identify three operational epistemic mechanisms - uncertainty separation, knowledge-boundary detection, and empirical Bayesian policy updating - and discuss five limitation categories.

[AI-92] KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

链接: https://arxiv.org/abs/2606.10358
作者: Guoliang Xu,James E. Corter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 33 pages including appendices, 1 figure

点击查看摘要

Abstract:Learning Bayesian network (BN) structure from sparse discrete data is hard: when each instance records only a few variables, most variable pairs lack the joint observations needed for reliable scoring, and data-only methods recover little structure. Imperfect domain knowledge, expressible as a weighted directed knowledge graph (KG), is often available. We propose KG-SoftMAP, which encodes such a KG as a soft, confidence-weighted, data-overridable edge prior and maximizes a MAP objective combining the BDeu score with a logit-form prior; the KG may be expert-curated or LLM-extracted. On controlled synthetic benchmarks, the only setting with ground-truth DAGs, KG-SoftMAP recovers partial directed structure at \rho=0.05 (DF1 0.14 to 0.29 , versus near-zero baselines) and substantially more once \rho\geq0.2 (DF1 0.46 to 0.96 ), when paired with an informative but imperfect KG; recovery degrades gracefully as KG quality drops. On real sparse educational data, which has no ground-truth DAG, we evaluate deployment-facing measures only: prediction, calibration, and KG-consistency. The learned BN is best read as a diagnostic model: on SAF it trails logistic regression by 0.03 F1_FAIL while providing KG-consistent edges, calibrated joint probabilities, and inference from arbitrary observed concept subsets; when no meaningful KG exists, discriminative logistic regression is preferable.

[AI-93] Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

链接: https://arxiv.org/abs/2606.10346
作者: Jiangnan Xia,Yucheng Shi,Yu Yang,Kishan Panaganti,Zhenwen Liang,Ninghao Liu
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.

[AI-94] Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

链接: https://arxiv.org/abs/2606.10334
作者: Haoyu Dong
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

[AI-95] Baseline-Free Policy Optimization for Neural Combinatorial Optimization

链接: https://arxiv.org/abs/2606.10321
作者: Carlos S. Sepúlveda,Gonzalo A. Ruz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Neural combinatorial optimization (NCO) trains autoregressive policies to solve routing problems. The standard training algorithm, REINFORCE with a rollout baseline, requires maintaining and periodically updating a frozen copy of the policy for variance reduction. This baseline introduces a structural vulnerability: on harder instances, a poor baseline produces noisy gradient estimates that can destabilize training. We evaluate Group Relative Policy Optimization (GRPO), an algorithm from large language model alignment that eliminates the baseline entirely by normalizing advantages within groups of sampled trajectories. In a controlled comparison of five RL algorithms on TSP and CVRP benchmarks within the RL4CO framework, we find that: (i) GRPO avoids the training collapse observed with REINFORCE on TSP-100, where performance degrades from cost 9.8 to 52.1 immediately after the warmup phase and does not recover under extended training; (ii) at matched gradient updates, GRPO achieves solution quality within 2% of POMO, a strong AM-based multi-start baseline, while requiring no external baseline; and (iii) P3O, a pairwise preference algorithm also from the alignment literature, is competitive on TSP but shows higher variability on CVRP. These results identify GRPO as a promising baseline-free alternative for NCO, particularly in settings where baseline-dependent training becomes fragile.

[AI-96] Mobility Anomaly Generation using LLM -Driven Behavior with Kinematic Constraints

链接: https://arxiv.org/abs/2606.10314
作者: Yueyang Liu,Joon-Seok Kim,Andreas Züfle
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although the study of human trajectory anomalies is critical for advancing spatial data mining, empirical research remains severely hindered by a pervasive lack of ground-truth datasets. Despite the availability of several real-world and simulated human trajectory collections, these datasets exclusively capture normal mobility patterns and lack annotated anomalies. This specific scarcity is fundamentally driven by the inherent statistical rarity of anomalous events, precluding the feasibility of conventional observational methods. Compounding this challenge, the systematic acquisition of large-scale mobility data is strictly bottlenecked by prohibitive costs and stringent privacy regulations. To overcome these fundamental limitations and establish a reliable human trajectory anomalies dataset with annotated ground truth, we introduce a novel, end-to-end generative framework designed to synthesize realistic trajectory anomalies at scale. Our architecture bridges the gap between purely synthetic mobility data and complex real-world physical constraints by operating directly on baseline simulated trajectories. We employ Large Language Model (LLM) agents to systematically inject semantically meaningful behavioral anomalies such as irregular out-of-distribution check-ins and skipped routine visits. To ensure rigorous spatial validity, the system leverages map-constrained routing reconstruction to recalculate the physical transitions between these LLM agent-modified staypoints. Moreover, to narrow the simulation-to-reality gap, we augment the resulting trajectories with a context-aware spatial noise model, parameterized by environmental and location-specific variables, to accurately emulate heterogeneous GPS sensor degradation.

[AI-97] LLM -Guided Neural Architecture Search for Robust Co-Design of Physical Neural Networks

链接: https://arxiv.org/abs/2606.10294
作者: Tyler King,Timothee Leleu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Deploying neural networks on unconventional hardware demands architectures that co-optimize task accuracy and platform-specific constraints such as energy cost, physical non-idealities, and numerical precision. Existing neural architecture search (NAS) methods are typically tailored to a single hardware family, limiting cross-platform comparison and generalization. We introduce Unconventional Hardware Neural Architecture Search (UH-NAS), a hardware-agnostic, LLM-guided NAS framework that integrates language models as evolutionary operators to co-optimize accuracy and inference energy. By exposing hardware as a swappable backend with per-platform energy models, physical constraints, and non-ideality simulators, UH-NAS enables fair system-level comparisons across various backends without modifying the search algorithm. Tested on optical MZI hardware, UH-NAS discovers more diverse, robust architectures than conventional baselines while outperforming existing LLM-to-NAS approaches. Additional ablations on architecture robustness under non-idealities and the role of system prompts highlight the importance of architecture-hardware co-design for emerging computing platforms.

[AI-98] Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling

链接: https://arxiv.org/abs/2606.10286
作者: Mustavi Ibne Masum,Thiago Eustaquio Alves de Oliveira,Mahzabeen Emu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-pit mine scheduling is a critical process for maximizing economic return under complex geotechnical and operational constraints. While Mixed-Integer Linear Programming (MILP) provides mathematically optimal baselines, its exponential computational complexity and inability to adapt in real time limit its practical deployment in dynamic industrial environments. This work introduces a simulator-driven Large Language Model (LLM) scheduling framework in which the LLM acts as an autonomous decision-making agent, guided at each step by a custom simulator that encodes geotechnical precedence, extraction-processing coupling, and dynamic capacity constraints directly into the action generation mechanism. Operating entirely zero-shot within a closed, data-secure environment, the framework produces complete, interpretable extraction and processing schedules without cloud-based inference, domain-specific fine-tuning, or retraining. To provide a trustworthy performance benchmark, a novel MILP formulation is developed that incorporates realistic operational and geotechnical constraints. Evaluated across mining instances of varying scale and time periods, the LLM-based framework recovers between 94% and 99% of the MILP optimal NPV while scaling linearly in computation time. These results position simulator-constrained LLM agents as a practical and scalable alternative to classical optimization for long-horizon industrial scheduling under complex operational constraints.

[AI-99] owards Robust Arabic Speech Emotion Recognition with Deep Learning

链接: https://arxiv.org/abs/2606.10278
作者: Youcef Soufiane Gheffari,Samiya Silarbi
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 21 pages, 16 figures, 11 tables. Submitted manuscript

点击查看摘要

Abstract:Speech Emotion Recognition (SER) aims to identify a speaker’s emotional state from audio signals. While recent advances in deep learning have significantly improved SER performance in Indo-European languages, Arabic SER remains underexplored and challenging due to dialectal diversity, limited annotated datasets, and the difficulty of modeling both local spectral cues and long-range temporal dependencies. To address these limitations, this study investigates whether hybrid architectures that jointly model spatial and contextual information can improve emotion recognition in Arabic speech. We propose and evaluate a comparative framework involving three architectures: a CNN-LSTM model, a CNN-Transformer model, and a fine-tuned wav2vec 2.0 model. The first two models leverage MFCC and spectrogram-based representations, while wav2vec 2.0 operates directly on raw audio through self-supervised representations. Experiments conducted on the EYASE and BAVED datasets demonstrate that the proposed CNN-Transformer architecture significantly outperforms the other models, achieving an accuracy of 98.1 percent. This result highlights the effectiveness of combining convolutional feature extraction with Transformer-based global context modeling. The main contribution of this work lies in providing a systematic comparison of hybrid and self-supervised approaches for Arabic SER, and in demonstrating that CNN-Transformer architectures offer a robust solution for capturing both spectral and long-range dependencies in low-resource and dialectally diverse settings. Comments: 21 pages, 16 figures, 11 tables. Submitted manuscript Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; I.5.4 Cite as: arXiv:2606.10278 [cs.SD] (or arXiv:2606.10278v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2606.10278 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Youcef Soufiane Gheffari [view email] [v1] Tue, 9 Jun 2026 00:59:43 UTC (5,253 KB)

[AI-100] Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

链接: https://arxiv.org/abs/2606.10276
作者: Dongjun Lee,Juheon Choi,Dong Kyu Shin,Sinjae Kang,Kimin Lee
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: We provide video demos and code in: this https URL

点击查看摘要

Abstract:For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human’s nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human’s first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human’s intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human’s nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.

[AI-101] What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents

链接: https://arxiv.org/abs/2606.10267
作者: Jiaheng Hu,Mohit Shridhar,Caden Lu,Dhruv Shah,Hao-Tien Lewis Chiang,Jie Tan,Annie Xie
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner. In this paper, we present a systematic study of Hi-VLA design for robot manipulation. We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot. Overall, our results provide a foundation for building more capable, robust, and principled hierarchical VLA agents. More information and video at this http URL.

[AI-102] Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning

链接: https://arxiv.org/abs/2606.10250
作者: Haengbok Chung,Jae Sung Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures, 13 tables. Accepted for publication in Neurocomputing (2025). Author Accepted Manuscript

点击查看摘要

Abstract:Class imbalance is a common problem in deep learning that severely degrades performance. In federated learning (FL), it is a critical factor contributing to non-identically distributed data (non-IID). Building on several previous attempts, we define and analyze imbalance issues in FL at three levels: inter-case, inter-class, and inter-client. Inter-case imbalance addresses the imbalance in every single class; inter-class imbalance compares the number of data between different classes. Inter-client imbalance represents different skewness of local data between clients. Based on these concepts, we propose FedBB, which consists of two main components: (1) Positive Negative Balanced (PNB) loss function addresses the inter-case and inter-class imbalances in local training, enhancing generalization on highly skewed local client datasets. It optimizes both multi-label and multi-class classifications by assigning higher weights to minority cases or classes. (2) Client Balanced Reweighting (CBR) reweights clients based on inter-client imbalance during model aggregation, giving greater weight to models trained on less skewed datasets. Various experiments on X-ray and natural image datasets demonstrate that FedBB outperforms other algorithms in both performance and efficiency. Additionally, it requires limited statistical information, which is beneficial for privacy protection. Through ablation studies, we proved that PNB loss and CBR independently contribute to performance. As FedBB aims to build a global model that accurately classifies all classes, it can serve as a baseline for the generic and personalized FL.

[AI-103] Linguistically Augmented Audio Speech Data (LinguAS)

链接: https://arxiv.org/abs/2606.10246
作者: Ashley R. Keaton,Zahra Khanjani,Christine Mallinson,Vandana P. Janeja
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveraging valuable linguistic cues at larger timescales. To address this gap, we present Linguistically Augmented Audio Speech Data (LinguAS), a dataset of genuine and deepfaked audio samples annotated with five strategically-chosen, Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and are characteristic of natural human speech. LinguAS contains over 800 audio samples, each of which are annotated with EDLFs. The dataset has a balanced number of four spoofed audio attack types and a proportionate number of genuine speech samples. We also include metadata on speaker gender and the generator/source for each spoofed audio sample, offering more granularity for model training. We found that models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR. LinguAS’s augmented linguistic, gender, and generator metadata provide audio deepfake researchers with a dataset that emphasizes real human language traits to improve model inference of faked speech. Data and code are publicly available.

[AI-104] YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale

链接: https://arxiv.org/abs/2606.10244
作者: Takehiko Ohkawa,Jumpei Arima,Yuki Noguchi,Masatoshi Tateno,Makoto Sugiura,Takuya Okubo,Kengo Ikeuchi,Yuma Shin,Hiroki Nishizawa,Naoaki Kanazawa,Yuki Wakayama,Daiki Fukunaga,Koshi Makihara,Tomohiro Motoda,Floris Erich,Yukiyasu Domae,Tatsuya Matsushima,Yohishiro Okumatsu,Kei Ota
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8,434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.

[AI-105] Regimes: An Auditable Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph

链接: https://arxiv.org/abs/2606.10241
作者: Yohei Nakajima
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 5 figures. Code and committed runs: this https URL

点击查看摘要

Abstract:Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failures go unlogged, diagnoses cannot be replayed, and promote-or-discard decisions land in a side database rather than the agent’s own history. We show that an event-sourced agent runtime removes that friction and turns controlled improvement into a first-class workflow. When the agent’s state is a deterministic projection of an append-only event log, failures are recorded, a run replays exactly from its log, candidate patches scope to typed pipeline seams, gates are auditable, and every promotion or discard is itself an event. We demonstrate this with Regimes, a loop on the ActiveGraph runtime that diagnoses failed evaluations, proposes a repair at a pipeline point, and promotes it only after static checks, sandbox execution, in-sample evaluation, and held-out validation. The loop is target-agnostic: the same control flow runs against different tasks through a common interface. On LongMemEval-S the dominant failure is not retrieval but reconciliation: the evidence is already in the assembled context, yet the reader answers incorrectly. Across five seeded held-out splits, Regimes discovers reader-prompt repairs that improve final held-out accuracy by +0.05 to +0.10 in four splits and +0.01 in one over-promotion split; two splits are individually significant (seed 5 unadjusted for its sequential promotion structure), and the pooled count is descriptive only, since the splits share one 500-question pool. The durable contributions are ActiveGraph as an auditable substrate that makes controlled improvement loops tractable, the held-out-gated loop it supports, the failure-regime taxonomy routing each failure to a pipeline location (whose marginal value over an unrouted baseline is the primary open question), and the prompt-as-discovery-probe hypothesis.

[AI-106] Minimalist Genetic Programming

链接: https://arxiv.org/abs/2606.10237
作者: Leonardo Trujillo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Genetic programming (GP) is based on two important insights. First, that any learning task can fundamentally be posed as a program induction problem, where the goal is to construct a symbolic hierarchical model that is expressed as a syntax tree. Second, to pose this task as a search problem, and use evolution to locate the desired model. Since it was proposed, GP has produced notable results in a wide range of tasks and problem domains. This work presents an alternative view by modifying the second core insight of GP, posing the problem as a syntactic derivation task instead. In particular, this paper presents Minimalist Genetic Programming (MGP), an algorithm that like GP is biologically inspired, but instead of evolution it takes inspiration from the Minimalist Program to human language, in which syntax is understood as an optimal solution to the problem of linking two other mental systems. In minimalism, the core computational process is a binary set formation operator called MERGE , than can be used to incrementally construct complex syntactic structures using a simple Markovian process. MGP is able to discover the core building blocks of the symbolic expressions, and to incrementally combined them using MERGE . The proposed system is benchmarked on symbolic regression tasks that are known to be difficult to solve with standard GP systems because of the propensity for bloat. Results show that when a proper lexicon of atomic syntactic objects are chosen, MGP is able to consistently produce the exact ground truth model on a set of symbolic regression where standard GP struggles to do the same. The insights provided by minimalism are shown to be relevant to the problem of program induction, and should be explored further based on the potential exhibited by MGP in this work.

[AI-107] SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration ICLR2026

链接: https://arxiv.org/abs/2606.10228
作者: Kaustubh Mani,Yann Pequignot,Vincent Mai,Liam Paull
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICLR 2026

点击查看摘要

Abstract:Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor’s sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor’s epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.

[AI-108] Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series

链接: https://arxiv.org/abs/2606.10219
作者: Henry Han,Diane Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages 5 figures;

点击查看摘要

Abstract:AI efficiency at scale is becoming critical in finance as market data volumes surge across equities, ETFs, FX, options, and high-frequency trading streams. This growth creates a core challenge for mature financial AI systems: models must learn from larger historical corpora while still meeting real-time latency constraints in trading, risk management, and derivative pricing. We use exact nearest-neighbor learning for high-frequency financial time series as a concrete case study to show that Mojo-based financial AI can address this challenge. We introduce a Mojo SIMD k-d tree with variance-based splitting, contiguous flat-buffer storage, and compile-time vectorized distance computation. We also provide a runtime result showing that, under standard pruning and implementation-cost assumptions, the Mojo SIMD k-d tree asymptotically dominates Mojo SIMD brute force and scikit-learn’s k-d tree in the fixed-stock, large- n , moderate-dimensional regime. Empirically, across eight financial datasets on x86 and ARM64 with up to 277K training samples, the method achieves 17.5–21.6 \times speedup over scikit-learn’s k-d tree on x86 and 28.1–43.5 \times over scikit-learn brute force on ARM64 equity/ETF datasets, while preserving exact outputs. Beyond nearest-neighbor inference, Mojo’s compiled execution enables an Extra Trees-based implied-volatility pricing model to train on 10\times more options data, reducing put-IV RMSE by 8.0%. These results position Mojo as a scalable, production-ready stack for financial AI and a promising foundation for efficient AI in other data-intensive fields. \keywordsFinancial AI \and AI Efficiency \and Mojo \and SIMD \and K-D Trees \and KNN \and High-Frequency Trading \and Financial Time Series \and Scaling

[AI-109] A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport

链接: https://arxiv.org/abs/2606.10216
作者: Sidahmed Benabderrahmanea,Petko Valtchev,James Cheney,Talal Rahwan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe class imbalance, and the challenge of generating realistic malicious behavior. These challenges are amplified in cross-operating-system (cross-OS) settings, where a detector trained on one source platform must be deployed on an unlabeled target platform without access to target-domain labels. We study this source-only cross-OS APT detection problem using system-level provenance traces and propose a transport-based framework for ranking anomalous target processes under zero target supervision. The framework abstracts process behavior into structured natural-language descriptions, embeds them using pretrained language models, and constructs a source-normal reference for target scoring. It combines three evidence channels: semantic deviation from source-normal prototypes, structural deviation captured by graph autoencoding, and geometric deviation measured through Optimal Transport (OT). The main contribution is an OT-based barycentric anomaly score that projects target embeddings onto the source-normal manifold and quantifies residual transport mismatch. We further introduce entropy-weighted, angle-aware, and density-aware OT variants to capture uncertainty, directional drift, and sparse-support behavior. Evaluation on DARPA Transparent Computing data spanning Linux, Windows, BSD, and Android, across two APT scenarios and twelve cross-OS transfer pairs, shows that the proposed framework improves ROC-AUC and nDCG over source-only anomaly-detection baselines. The results demonstrate that source-only provenance modeling, combined with semantic abstraction and OT-based anomaly scoring, can support practical cross-platform APT detection without target-domain supervision.

[AI-110] Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning ALT

链接: https://arxiv.org/abs/2606.10213
作者: Diane Myung-kyung Woodbridge,Jee Hyun Suh
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: This paper will be presented at IEEE ICTs4ehealth in June, 2026

点击查看摘要

Abstract:Speech sound disorders affect approximately 44% of Korean pediatric communication disorder cases, yet automated assessment tools for Korean toddler speech remain underdeveloped. This paper presents an end-to-end pipeline for automated pronunciation evaluation of Korean toddler speech, combining neural speaker diarization with self-supervised speech representation learning. We introduce a novel IRB-approved corpus of 53 recordings from Korean-speaking children aged 2-5 years. A subset of 53 subjects was annotated by three independent reviewers, yielding 1,190 consonant and 748 vowel word-level binary correctness labels. We evaluate three diarization models, finding that NeMo SortFormer achieves 88.69% speaker count accuracy and 33.04% diarization error rate (DER) owing to its arrival-time-sorted transformer architecture, which handles the acoustic confound between young female caregivers exhibiting aegyo and toddler speech. For pronunciation scoring, we compare three self-supervised learning (SSL) backbones across multiple pooling strategies. A cross-model ensemble routing consonant prediction to HuBERT-large and vowel prediction to WavLM-large achieves balanced accuracies of 0.720 and 0.845, with a mean of 0.782.

[AI-111] Less Context Better Agents : Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

链接: https://arxiv.org/abs/2606.10209
作者: Abhilasha Lodha,Mahsa Pahlavikhah Varnosfaderani,Abir Chakraborty,Abhinav Mithal
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 17 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools. We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison. The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours. We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.

[AI-112] Exploration of Foundation Model-Based Robots in Patient and Elderly Care

链接: https://arxiv.org/abs/2606.10208
作者: Zhiwen Qiu,Wei Liu,Yuexing Hao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Demand for older-adult and patient care is growing rapidly as populations age worldwide. Foundation models are increasingly being integrated into robots and interactive agents, with the promise of more flexible communication and personalized assistance. However, care settings require reliable and workflow-compatible systems with accountable human oversight, and it remains unclear whether current embodied systems can translate technical advances into clinical impact. This Perspective synthesizes foundation model-based care robots across three areas: design features, user experience, and evidence for care-related outcomes. Current systems most commonly use foundation models as conversational and reasoning layers within voice-centered socially assistive embodiments, while multimodal grounding and physical autonomy remain limited. Empirical evaluations report positive usability and engagement benefits, but reliability failures persist across the interaction pipeline such as hallucinations and conversational breakdowns. Evidence for care impact remains concentrated in proximal outcomes such as cognitive engagement and participation, with limited evidence for validated clinical or care-related changes. We argue that future research should transition toward care-specific evaluation standards, accountable autonomy, and integration into care workflows to support more responsive and responsible care technologies.

[AI-113] MMClima: A Framework for Multimodal Climate Science Data and Evaluation

链接: https://arxiv.org/abs/2606.10194
作者: Muhammad Umer Sheikh,Hassan Abid,Khawar Shehzad,Ufaq Khan,Muhammad Haris Khan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We introduce MMClima, a large-scale multimodal climate question answering framework with 104k+ expert-validated question-answer pairs spanning articles, video transcriptions, and figures across five core climate science domains. MMClima is constructed via automated claim extraction and QA synthesis with human-in-the-loop validation to ensure both scale and reliability. Using MMClima, we benchmark state-of-the-art multimodal language models on tasks requiring factual recall, visual interpretation, and cross-modal synthesis. We additionally fine-tune on the textual split to produce mmclima-70b-txt, a domain-adapted baseline that outperforms strong open- and closed-source models on textual QA. We release the dataset, evaluation pipeline, fine-tuned model weights, and data creation framework to support standardized multimodal evaluation for climate science.

[AI-114] Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

链接: https://arxiv.org/abs/2606.10184
作者: Wooil Jung
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) relies on the diversity of K rollouts within each group; otherwise, the group-mean advantage A^(k) = r^(k) - \mu_r collapses to zero. This presents a structural challenge for latent-reasoning models like Coconut, which feed continuous hidden states recurrently in place of discrete chain-of-thought tokens. Because the latent phase is inherently deterministic given the parameters and prompt, multiple rollouts produce identical trajectories, stalling GRPO’s progress. Consequently, applying group-relative reinforcement learning to continuous latent reasoning has proven difficult. To address this, we propose sourcing the necessary stochasticity through structured dropout. By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, we generate essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy. We provide both theoretical justification for this method – including unbiasedness, variance reduction, and the well-definedness of the latent gradient – and empirical validation. On GSM8K, dropout-GRPO improves a Coconut baseline from 27.29% to 29.01% pass@1, demonstrating the viability of GRPO learning for latent-reasoning models. Our work positions this as a practical, theoretically grounded approach for post-training latent-reasoning LLMs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.10184 [cs.LG] (or arXiv:2606.10184v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.10184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-115] Local Is Not a Sufficient Privacy Boundary: Governing OS-Integrated On-Device AI

链接: https://arxiv.org/abs/2606.10173
作者: Jonghyun Chung,Sanket Badhe
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI systems move into operating systems, privacy no longer turns only on whether a model runs locally. A local assistant may assemble email, calendar entries, files, screenshots, notifications, and app intents; retain embeddings or summaries; invoke tools; emit telemetry; or route difficult requests to cloud infrastructure. Local inference reduces some exposure, but it answers only one question: where computation occurs. It does not answer who may assemble context, what derived state persists, which actions are authorized, or how updates change the system’s authority. We develop an OS-centered privacy framework for on-device AI that treats privacy as an institutional accountability problem rather than a deployment attribute. The framework specifies a threat model, a six-part privacy risk taxonomy, privacy-by-architecture controls, and a four-level audit rubric. We demonstrate the rubric through a documentation-bounded comparison of Apple Intelligence/Foundation Models, Android AICore/Gemini Nano, and Microsoft Recall. Meaningful privacy in on-device AI depends on constrained information flow, bounded authority, visible user control, and auditable governance across the operating-system lifecycle.

[AI-116] FedSteer: Taming Extreme Gradient Staleness in Federated Learning with Corrective Projections and Caching UAI2026

链接: https://arxiv.org/abs/2606.10124
作者: Haoran Zhang,Cainã Figueiredo Pereira,Marie Siew,Xutong Liu,Carlee Joe-Wong,Rachid El-Azouzi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: UAI 2026

点击查看摘要

Abstract:Federated learning (FL) is often subject to aggregation variance if clients do not consistently participate in training rounds. While reusing stale model updates from inactive clients is a common technique to reduce this variance, we find that with skewed client participation, the resulting update staleness can become severe enough to destabilize training. To remedy this, we propose FedSteer, a novel method that constructs a gradient subspace from a cache of recent client gradients to serve as a low-dimensional representation of the current optimization landscape. FedSteer projects an active client’s true gradient onto this subspace to find a set of optimal coordinates. For an inactive client, FedSteer reuses these coordinates with the now-evolved subspace drifted by other active clients. This process effectively “steers” outdated gradients toward the current global objective. This is complemented by a selective caching strategy that identifies a representative client subset to form the subspace, reducing server memory. Experiments demonstrate that FedSteer significantly outperforms baselines, preventing performance collapse in challenging scenarios while delivering accuracy gains of over 7% in others.

[AI-117] Duality for Optimal Multi-Item Multi-Bidder Auction Design: Revenue Certificates through Deep Learning

链接: https://arxiv.org/abs/2606.10112
作者: Yanchen Jiang,David C. Parkes,Tonghan Wang
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:Characterizing revenue-optimal auctions for multi-item, multi-bidder settings remains a fundamental open problem, with no known closed-form solution existing beyond restrictive binary-type instances. This has motivated interest in computational approaches to optimal auction design. In this paper, we introduce the first computational framework that directly tackles the dual problem for multi-item, multi-bidder auctions and dominant-strategy incentive compatibility (DSIC), generating certified revenue upper bounds. Our approach parametrizes Lagrange multipliers with a structurally guaranteed strict flow-conservation property using neural networks, enabling efficient optimization over feasible dual solutions via gradient descent. To bridge the gap between discrete computational methods and theoretical guarantees for continuous types, we develop a novel lifting technique that maps dual certificates from coarse discretizations to fine refinements. We prove that lifting gives valid revenue upper bounds for multi-item, multi-bidder auctions with continuous uniform valuations. Furthermore, we give a generalized lifting construction for arbitrary continuous distributions and demonstrate that these lifted duals converge to the revenue of the original continuous problem in the discrete limit. We validate this computational framework for the dual auction design problem by recovering known analytical mechanisms for canonical instances. For multi-item multi-bidder problems, our framework establishes a small gap between the optimal revenue and best-known DSIC mechanisms, providing computational certificates of near-optimality.

[AI-118] What makes a harness a harness: necessary and sufficient conditions for an agent harness

链接: https://arxiv.org/abs/2606.10106
作者: Sanderson Oliveira de Macedo
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The term agent harness now circulates widely in software engineering with generative artificial intelligence. It names the layer that wraps a language model and turns it into a coding agent able to act on a repository. The usage is loose and polysemous. Sometimes the term denotes the whole product (Claude Code, Codex CLI); sometimes it denotes the evaluation scaffold that runs an agent against tasks (the SWE-bench harness); sometimes it gets conflated with an agent framework, an SDK, an IDE plugin, or an orchestrator. What is missing is a reference definition that works as an instrument, one that includes and excludes cases consistently. We build that definition through a conceptual analysis that combines works with persistent identifiers and primary grey-literature sources, such as official documentation, glossaries, and engineering reports. We reconstruct the genealogy of the term, from the horse’s tack to the classic test harness, to the machine-learning evaluation harness, and finally to the agent harness. We then propose a constitutive definition that states the necessary and sufficient conditions for a system to be an agent harness, we operationalize it as an inclusion and exclusion test, and we draw the boundary of the concept against an agent framework, an agent SDK, an IDE plugin, an eval harness, and an orchestrator. We apply the definition to six real harnesses (Claude Code, Codex CLI, Aider, Cline, OpenHands, and SWE-agent) and to deliberate edge cases; the test includes and excludes consistently. We close with a research agenda organized by design tension axes. The contribution is an operational definition of agent harness, with a shared vocabulary, able to guide engineering practice and the scientific comparison of agentic systems.

[AI-119] Unsupervised Style Representation Learning for AI-Text Detection via Paraphrase Inversion

链接: https://arxiv.org/abs/2606.10099
作者: Rafael Rivera Soto,Barry Chen,Nicholas Andrews
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has raised concerns about misuse such as plagiarism, misinformation, and automated influence operations, motivating the need for robust detectors. Recent work has shown that neural representations of writing style are effective for detection and, crucially, robust to adversarial attacks that defeat most existing detectors. However, current style-based detectors rely on authorship labels for training, and are limited to few-shot inference for detection, requiring in-distribution samples that may not always be available. We learn discriminative style features without authorship labels by training a style encoder to reconstruct human-authored text from its machine-generated paraphrase; freezing a semantic encoder during training biases the style encoder to capture only the non-semantic features needed for reconstruction. We evaluate the learned representations via two detection strategies: a few-shot detector and a zero-shot DeepSVDD-based detector. Across benchmarks, our method matches or outperforms all baselines in the few-shot setting and, in the zero-shot regime, is competitive with fully supervised classifiers on in-distribution test data while generalizing better to unseen LLMs. Beyond detection, the learned representations generalize to unseen tasks, achieving competitive performance on authorship verification and fine-grained style discrimination despite never being trained on either objective.

[AI-120] Predictive Assistance and the Temporal Dynamics of Exploratory Compression

链接: https://arxiv.org/abs/2606.10094
作者: Balaraju Battu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classical theories of cognition describe problem solving as exploratory search through structured problem spaces in which repeated interaction gradually compresses search into efficient representational structures. Predictive artificial intelligence systems introduce a distinct regime in which stabilization may occur before exploratory diversification unfolds, supplying solutions and decision trajectories prior to internally generated search. This paper develops a geometric dynamical framework in which attention evolves over a landscape of strategies shaped by stabilizing drift, endogenous exploratory perturbation, and responsiveness-gated learning. Predictive assistance is modeled as a process of exogenous exploratory compression that stabilizes trajectories before self-generated exploration broadens the accessible regions of strategy space. The framework yields three main results. First, sustained predictive stabilization reduces exploratory responsiveness by attenuating the effective influence of intrinsic perturbations even when exploratory variability remains present. Second, curvature accumulates and relaxes asymmetrically, producing hysteresis and delayed recovery of exploratory mobility after assistance withdrawal. Third, developmental outcomes depend critically on the timing of stabilization, with early intervention narrowing future exploratory traversal before broad representational diversification has occurred. The framework generates empirically testable predictions concerning exploratory entropy, premature convergence, and delayed recovery following predictive stabilization. More broadly, the results suggest that predictive systems may reshape the geometry of exploratory cognition itself.

[AI-121] A Theory on Flow Matching with Neural Networks

链接: https://arxiv.org/abs/2606.10089
作者: Yihan He,Qishuo Yin,Yuan Cao,Jianqing Fan,Han Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we develop theoretical foundation for flow matching with neural-network-parameterized conditional velocity fields. We establish convergence guarantees for gradient descent in the over-parameterized 2-layered ReLU neural network regime. We derive generalization bounds for the conditional velocity-field matching objective. Building on these results, we provide Wasserstein-distance guarantees for the samples generated by the induced flow. Our analysis is based on generalization bound for multi-task representation learning with unbounded losses, which may be of independent interest beyond flow-based generative modeling. These theoretical results are validated through extensive experiments on both synthetic and real-world image benchmarks.

[AI-122] Exploratory Responsiveness and Adaptive Rigidity under AI-Assisted Optimization

链接: https://arxiv.org/abs/2606.10086
作者: Balaraju Battu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper develops a theory of exploratory adaptation under AI-assisted optimization. The central argument is that the long-run adaptive effects of AI systems depend critically on how predictive assistance interacts with exploratory responsiveness itself. We formalize this mechanism using a dynamical framework in which cognitive, institutional, and technological systems evolve over rugged epistemic landscapes characterized by multiple locally reinforced configurations. A central state variable in the model is adaptive responsiveness, which measures the capacity of a system to traverse unfamiliar conceptual and institutional trajectories under changing conditions. Under convergent predictive regimes, AI systems substitute for exploratory engagement, reducing adaptive responsiveness and generating metastable trapping, hysteresis, premature convergence, and exploration-collapse dynamics in which systems become locally efficient but globally rigid. The framework also identifies contrasting exploration-enhancing regimes in which AI systems amplify exploratory search, conceptual traversal, and adaptive mobility. The effective substitution parameter is therefore responsiveness-dependent: systems possessing weak exploratory routines are more vulnerable to exploratory substitution, whereas systems already possessing high adaptive responsiveness may use AI assistance to expand exploratory mobility across rugged landscapes. The long-run adaptive effects of AI consequently depend not only on AI capability itself, but also on institutional structure, developmental context, and the architecture of human-machine interaction.

[AI-123] Divide-and-Conquer Modeling for the CTF-4-Science Lorenz Benchmark

链接: https://arxiv.org/abs/2606.10084
作者: Shundong Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents a divide-and-conquer modeling strategy for the CTF-4-Science Lorenz benchmark, which evaluates chaotic-system prediction across twelve hidden scores and five scenario families: clean forecasting, noisy reconstruction, noisy-input forecasting, few-shot learning, and parametric generalization. Rather than forcing one model class to handle all regimes, the final system matched each prediction block to the evaluation behavior of its task group. The main contributions are: smoothing-based reconstruction for noisy full-trajectory denoising; NG-RC/NVAR models tuned for noisy long-time attractor forecasting; a fitted Lorenz transition correction restricted to the sensitive clean short-time prefix; and a parametric prefix blend for the interpolation task. The resulting system with final public score of 79.63 shows that bounded, scenario-specific updates can outperform broad model replacement on mixed chaotic forecasting benchmarks.

[AI-124] VFUSE: Virulent Feature Understanding with Sparse autoEncoders

链接: https://arxiv.org/abs/2606.10080
作者: Michael Yu,Matthew L. Olson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model’s representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC 0.84 ( q 10^-13 ). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.

[AI-125] mporal Sheaf Neural Networks with Dynamic Orthogonal Transport

链接: https://arxiv.org/abs/2606.10071
作者: Md Sadek Hossain Asif,Tanzila Khan,Md. Mosaddek Khan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Temporal Sheaf Neural Networks (TSNN), a temporal link prediction framework that equips each node with a time-varying orthogonal frame and compares node states only after explicit transport between local coordinate systems. In contrast to existing continuous-time graph models that operate in a shared global embedding space, TSNN models node-specific and evolving interaction semantics through dynamic local frames. The model parameterizes per-node frames via efficient low-rank Householder products, preserves stored hidden states exactly under frame updates, and uses a geometric-residual decoder that anchors predictions on transported distances while learning residual corrections. All computations are strictly causal and use only the pre-event history. We show that the symmetric degree-normalized sheaf Laplacian is orthogonally similar to the symmetric normalized graph Laplacian, with the random-walk normalized form similar in the corresponding degree metric; the full-active, feature-scaled diffusion used by TSNN is exactly a metric-gradient step on the combinatorial sheaf Dirichlet energy, with a degree-free monotone-descent and non-expansiveness guarantee. Frame drift perturbs updates only linearly. Across TGB v2 link-prediction and temporal-heterogeneous leaderboards, together with the DGB benchmark suite, TSNN matches or surpasses the strongest prior methods on most benchmarks, with the largest improvements on graphs exhibiting strong node-role heterogeneity. Ablations confirm the distinct benefit of dynamic frames, orthogonal transport, and geometric-residual decoding.

[AI-126] Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization IJCNN2026

链接: https://arxiv.org/abs/2606.10068
作者: Ruinan Wang,Ian Nabney,Mohammad Golbabaee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures. Accepted to IJCNN 2026

点击查看摘要

Abstract:Hyperparameter Optimization (HPO) is essential for building high-performing ML/DL models, yet conventional optimizers often struggle in high-dimensional spaces where evaluations are costly and progress is diluted across many low-impact variables. We propose Greedy Importance First (GIF), an importance-aware scheduling strategy that uses a small-sample warm start to estimate hyperparameter importance, forms importance-based groups, allocates trials proportionally, and retains a full-space fallback. We evaluate GIF under fixed evaluation budgets on five anisotropic analytic functions, Bayesmark, and NAS-Bench-301. On the higher-dimensional benchmarks, GIF reaches better incumbents with faster convergence than TPE, BOHB, Random Search, and Sequential Grouping. On Bayesmark, where the effective dimensionality is smaller, GIF remains competitive but the margins are smaller. Ablation studies show that importance estimation, proportional allocation, and the fallback step all contribute to the gains. We also verify that the HIA component recovers the intended anisotropy on the analytic benchmarks. These results suggest that GIF is a simple and plug-compatible way to improve sample efficiency in high-dimensional HPO.

[AI-127] Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces

链接: https://arxiv.org/abs/2606.10064
作者: Shardul Bansal,Seth Schilbe,Jarrod Barnes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, Data and Models available at: this https URL

点击查看摘要

Abstract:Small-model agentic post-training is bottlenecked less by the algorithm than by the trajectory substrate it consumes. Leading recipes (RLVR, group-relative RL, rejection-sampled re-SFT) all need multi-turn traces carrying per-trajectory supervision, and the two existing sources fall short: frontier-synthesised data inherits the synthesizer’s biases and collapses the long tail, while unfiltered production logs are unjudged and contaminated by shortcut behaviour. We argue that an incentive-aligned agent arena can be engineered to manufacture such trajectories, and demonstrate this on ORO Subnet 15 (SN15), a Bittensor deployment of the ShoppingBench agentic-commerce benchmark. SN15’s race mechanism, LLM reasoning judge, and rotating leak-cluster-guarded problem suite yield a corpus with three properties: incentive-aligned diversity, per-trajectory judging, and anti-memorised held-out evaluation. We introduce a structural-quality filter that converts the raw firehose into a trainable corpus by keeping agentic trajectories (the model itself emits the tool calls) and rejecting sub-task trajectories (the model only classifies or narrates over a deterministic search loop), then post-train Qwen3-4B with a recipe matched to the published ShoppingBench SFT-then-GRPO pipeline. On a leak-cluster-guarded held-out partition scored production-strict, the model lifts from the published Qwen3-4B base of 18.0% ASR to 42.7%, within single-problem noise of the synthetic-data SFT-only baseline (43.6%), while training on a fraction of a single day of subnet output. The supervised stack leaves a large pass@8 to pass@1 gap (53.3% vs 34.8%); a per-step teacher-grounded Dr. GRPO reward converts that headroom into process improvement, and we identify the sub-task firehose as the primary lever for closing the gap to the 48.7% SFT+GRPO bar. We release the filter, the corpus splits, and the arena mechanics.

[AI-128] Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

链接: https://arxiv.org/abs/2606.10046
作者: Yuxuan Chen,Haoyuan Xu,Peize He
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.

[AI-129] Business World Model

链接: https://arxiv.org/abs/2606.10044
作者: Cecil Pang,Hiroki Sayama
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Businesses are increasingly adopting AI-enabled tools to improve productivity, reduce costs, and enhance products and services. However, the transformative potential of AI extends beyond automating predefined tasks: it lies in enabling intelligent systems to plan, optimize, and execute business initiatives from high-level strategic objectives. This paper introduces the concept and architecture of a business world model (BWM), a world model specialized for business and organizational environments. Inspired by world models in artificial intelligence, cognitive science, and control theory, a BWM encodes business states, dynamics, constraints, objectives, and feasible action space to support autonomous decision-making. We propose a business-semantics-centric formulation in which business states, dynamics and actions are linked to key business entities. Within this framework, agents can simulate alternative action sequences, estimate their effects on future business outcomes, and evaluate trade-offs under uncertainty. The proposed architecture integrates semantic data representations, probabilistic machine learning models, deterministic business rules, and explicit action space into a coherent structure for planning and counterfactual reasoning. Although its individual components are not new, the contribution of BWM lies in organizing them as an executable internal simulator for business initiatives. This work establishes a conceptual foundation for autonomous business systems capable of moving from instruction-based execution toward goal-driven planning and execution.

[AI-130] Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

链接: https://arxiv.org/abs/2606.09962
作者: Vadim Popov,Wenju Gu,Tasnima Sadekova,Georgii Aparin,Assel Yermekova
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of Kullback-Leibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.

[AI-131] 3SPO: State-Score-Supervised Policy Optimization for LLM Agents

链接: https://arxiv.org/abs/2606.09961
作者: Yu Han,Kailing Li,Yang Jiao,Yulin Dai,Yuqian Fu,Linhai Zhuo,Tianwen Qian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbfState-Score-Supervised Policy Optimization (3SPO), a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by +22.6% on ALFWorld and +15.6 points on WebShop, while using comparable resources to achieve 2.4\times more state exploration and 1.8\times faster convergence. Code is available at this https URL.

[AI-132] HydraCIL: Decoupled Class-Incremental Learning through Prototype-Guided Multi-Head Classifiers IJCNN2026

链接: https://arxiv.org/abs/2606.09960
作者: Daniel Vila-Cruz,Laura Morán-Fernández,Verónica Bolón-Canedo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the International Joint Conference on Neural Networks (IJCNN 2026)

点击查看摘要

Abstract:We present HydraCIL, a decoupled continual learning model based on prototype-guided multi-head classifiers, targeting sustainable deployment in embedded and resource-constrained environments. While most Class-Incremental Learning (CIL) methods rely on powerful hardware and long retraining cycles, real-world systems, such as robots or edge AI devices, must adapt quickly with limited resources. HydraCIL addresses this gap by freezing the backbone and decoupling feature extraction from learning. For each task, features are extracted once and a lightweight, task-specific classifier head is created, avoiding costly backbone retraining. At inference, HydraCIL selects the appropriate head via similarity with prototypes. Experiments on CIFAR-100, ImageNet-100, CoRe50, and Flowers102 datasets show that HydraCIL matches or outperforms state-of-the-art CIL methods while significantly reducing training time and carbon footprint, making it a practical solution for continual learning in real-world and embedded settings, where energy efficiency and rapid adaptation are critical.

[AI-133] mporal Context Conditioning for Seasonality-Aware Precipitation Nowcasting of High-Intensity Rainfall

链接: https://arxiv.org/abs/2606.09959
作者: Gijs van Nieuwkoop,Siamak Mehrkanoon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Precipitation nowcasting is increasingly being approached with deep learning models that learn directly from recent radar observations. Although such models can efficiently capture short-term precipitation motion, they often lack broader contextual information about the meteorological conditions under which rainfall develops. This paper investigates whether lightweight temporal context can improve radar-based nowcasting, particularly for high-intensity rainfall. We propose the Time-Aware Small-Attention U-Net (TA-SmaAt-UNet), which extends the core SmaAt-UNet model with temporal conditioning layers that use cyclical encodings of time-of-day and time-of-year to modulate intermediate feature representations. Experiments on KNMI radar precipitation data show that temporal conditioning is most beneficial for rare, high-intensity precipitation events, while also improving the representation of seasonal variability and predicted rainfall-intensity distributions. A layer conductance analysis further indicates that the added temporal conditioning layers are actively used by the model despite their small parameter cost. These findings suggest that simple, physically motivated temporal context can improve the realism and reliability of deep learning-based precipitation nowcasts. The implementation of our models and training setup is available on \hrefthis https URLGitHub.

[AI-134] Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment

链接: https://arxiv.org/abs/2606.09958
作者: Ming Cheng,Hao Chen,Ziyi Yang,Ziluowen Luo,Senzhang Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In mixed-traffic environments where autonomous and human-driven vehicles may co-exist, motion planning for autonomous vehicles requires anticipating the future behaviors of surrounding human drivers. Existing reinforcement learning-based methods generally directly incorporate the predicted human intents into the observation to enable a proactive planning. However, human intent is inherently uncertain due to the behavioral diversity, perception noise, and partial observability. Treating predicted intends as deterministic states can result in unsafe decisions for autonomous vehicles. To address this problem, we propose Uncertainty-Aware Motion Planning (UAMP), which incorporates uncertainty in human intent prediction for AV decision-making. Specifically, UAMP first introduces a proximity-aware uncertainty estimator to quantify the interaction-conditioned intent uncertainty and constructs an uncertainty-guided joint intent distribution over surrounding human-driven vehicles. Within this uncertainty set, UAMP further introduces Uncertainty-Calibrated Value Learning (UCVL) to correct value function learning biases arising from directly incorporating uncertain human intent predictions into the observation. Extensive experiments in various mixed-traffic scenarios show that UAMP significantly improves safety and driving comfort, while maintaining traffic efficiency compared with existing approaches. The code is released at this https URL.

[AI-135] Does Normalization Choice Matter for Causal Large Time-Series Models?

链接: https://arxiv.org/abs/2606.09954
作者: Samy-Melwan Vilhes,Gilles Gasso,Mokhtar Z Alaya(LMAC)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large models for time-series forecasting have been emerged as a promising paradigm for training models on heterogeneous collections of signals. These models typically rely on causal autoregressive architectures, where each observation is sequentially predicted from past. In practice, real-world time-series exhibit non-stationarities, which significantly influence predictive performance. To mitigate this, normalization is commonly employed. However, in efficient causal settings it might induce information leakage from future observations during training. Recent alternatives, including causal normalization and statistics computed from initial observations, have been proposed to address this issue, but their practical implications remain insufficiently understood. In this work, we evaluate normalization strategies for transformer-based large time-series models trained with patching and efficient causal strategy. We showcase that normalization choice significantly influences both training convergence and forecasting performance.

[AI-136] Learning Where to Simulate: Generative Active Sampling for Online PDE Surrogate Training

链接: https://arxiv.org/abs/2606.09949
作者: Pierre Cesar(DATAMOVE),Sofya Dymchenko(DATAMOVE),Abhishek Purandare(DATAMOVE),Bruno Raffin(DATAMOVE)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data-driven PDE surrogates are trained with data produced by numerical PDE solvers. However, when the surrogate’s goal is to generalize across a wide range of PDE configurations (e.g., initial conditions and physical coefficients), generating a representative training set is non-trivial. Uniform sampling of configuration parameters often under-represents trajectories exhibiting challenging dynamics, leading to high prediction errors and large error variance in the trained surrogate. Online training, where data generation and surrogate training are coupled, offers a natural advantage by allowing solver parameters to be steered on-the-fly. To efficiently exploit this capability, we introduce Online Generative Active Sampling (OGAS), an active learning method that reactively learns the relationship between configuration parameters and surrogate performance to control the sampling distribution. OGAS trains a fast diffusion model in parallel to the surrogate to act as a conditional sampler, mapping a surrogate-derived difficulty signal (e.g., loss or uncertainty) to configuration parameters. By actively drawing target signals from a prior biased toward high difficulty, OGAS continuously steers data generation toward challenging regimes without delaying the training workflow. We evaluate OGAS across 2D PDEs with distinct challenging dynamics (Kuramoto-Sivashinsky, Navier-Stokes, Gray-Scott) and up to 308 parameters, using multiple surrogate architectures. Across all settings, OGAS consistently improves tail statistics, yielding substantial reductions in errors above the 99th percentile and overall error dispersion compared to uniform sampling. While prioritizing challenging trajectories introduces a trade-off with average error, OGAS effectively ensures worst-case reliability of trained surrogates with negligible wall-time overhead.

[AI-137] Anomaly Detection and Root Cause Analysis for Microservice Systems

链接: https://arxiv.org/abs/2606.09942
作者: Luan Pham
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This is the pre-print of my PhD thesis, submitted to RMIT University

点击查看摘要

Abstract:Microservice systems are widely used to build cloud applications, yet their complexity makes failures inevitable, degrading user experience and causing economic loss. Automated anomaly detection and root cause analysis (RCA) are now active research areas, but existing techniques share five limitations. First, most treat anomaly detection and RCA separately, assuming anomalies are detected correctly, and falter when detection is imprecise due to noise or delay. Second, they focus on metrics, logs, and traces, leaving event data such as API calls and configuration changes underexplored. Third, many require a given service call graph and cannot diagnose without one. Fourth, the field lacks standardised datasets and evaluation frameworks, so methods are hard to compare fairly. Fifth, although causal inference-based RCA has become dominant, its effectiveness, efficiency, and robustness remain unclear. This thesis addresses these limitations through two groups of contributions. The first introduces methods that exploit observability data both independently and collectively. BARO is an end-to-end anomaly detection and RCA approach for metric data. EventADL is an end-to-end framework for event data. TORAI is a multimodal RCA framework that requires no service call graph. Extensive experiments on real microservice systems demonstrate their effectiveness and robustness. The second group delivers benchmarking datasets, an evaluation framework, and systematic evaluation efforts. RCAEval is a comprehensive benchmark providing ready-to-use datasets and reproducible baselines for future research. A systematic evaluation of existing RCA methods, especially causal inference-based approaches, offers insights that guide future directions. This thesis thereby advances automated anomaly detection and RCA for microservice failures, enabling future research on incident mitigation and remediation. Comments: This is the pre-print of my PhD thesis, submitted to RMIT University Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.09942 [cs.SE] (or arXiv:2606.09942v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.09942 Focus to learn more arXiv-issued DOI via DataCite

[AI-138] Interactions Between Crosscoder Features: A Compact Proofs Perspective NEURIPS2025

链接: https://arxiv.org/abs/2606.09940
作者: Dmitry Manning-Coe,Thomas Read,Anna Soligo,Oliver Clive-Griffin,Chun-Hei Yip,Rajashree Agrawal,Jason Gross
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the NeurIPS 2025 Workshop on Mechanistic Interpretability

点击查看摘要

Abstract:Dictionary learning methods like Sparse Autoencoders (SAEs) and crosscoders attempt to explain a model by decomposing its activations into independent features. Interactions between features hence induce errors in the reconstruction. We formalize this intuition via compact proofs and make five contributions. First, we show how, \textitin principle, a compact proof of model performance can be constructed using a crosscoder. Second, we show that an error term arising in this proof can naturally be interpreted as a measure of interaction between crosscoder features and provide an explicit expression for the interaction term in the Multi-Layer Perceptron (MLP) layers. We then provide three applications of this new interaction measure. In our third contribution we show that the interaction term itself can be used as a differentiable loss penalty. Applying this penalty, we can achieve ``computationally sparse’’ crosscoders that retain 60% of MLP performance when only keeping a single feature at each datapoint and neuron, compared to 10% in standard crosscoders. We then show that clustering according to our interaction measure provides semantically meaningful feature clusters, and finally that sleeper agents have significant interactions. Code is available at this https URL.

[AI-139] One Lens Many Worlds : A Capability-Typed Interface for World-Model Interpretability

链接: https://arxiv.org/abs/2606.09936
作者: Bhavith Chandra Challagundla,Sanskar Pandey,Param Thakkar,Rishikesh Mallagundla,Yugandhar Reddy Gogireddy,Wenhao Lu,Hindol Roy Choudhury,Shravani Challagundla,Mohamed Deraz Nasr,Spursh Deshpande
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:World models are now built on substantially different computational substrates. Latent recurrent state-space models such as PlaNet and the Dreamer family compress observations into recurrent states; token-based models such as IRIS quantize observations into a learned codebook and predict autoregressively with a transformer; and joint-embedding predictive architectures such as I-JEPA predict in a learned latent space with no pixel decoder. The interpretability methods applied to these models, including probing, activation patching, sparse autoencoders, and surprise analysis, share a common set of primitives, yet they are re-implemented from scratch for each architecture because existing hook-and-cache tooling assumes a transformer language model with no notion of actions, environment steps, or imagined rollouts. We argue that this fragmentation reflects the tooling rather than the models, and that the shared structure of world models is captured by a small typed interface. We present WorldModelLens, an open-source interpretability substrate organized around a capability-typed adapter: every model implements four required methods (encode, transition, initial state, sample) and declares a set of optional heads (decode, reward, continue, actor, critic) through an explicit capability descriptor, so that reinforcement-learning and self-supervised world models are first-class without either imitating the other. A single hook and cache layer exposes time-indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once.

[AI-140] GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines

链接: https://arxiv.org/abs/2606.09935
作者: Jafar Isbarov,Umid Suleymanov,Ilia Shumailov,Murat Kantarcioglu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-powered agents are increasingly embedded in continuous integration and continuous delivery/deployment (CI/CD) pipelines to autonomously review pull requests (PRs), triage issues, and maintain codebases. These agents ingest untrusted content while operating with elevated repository permissions, making them a natural target for prompt injection attacks with supply chain consequences. We present GitInject, an open-source framework for evaluating prompt injection vulnerabilities in real, live GitHub workflows, a widely deployed instance of CI/CD pipelines. Unlike prior agent security benchmarks that simulate tool calls, GitInject provisions ephemeral repositories and triggers actual workflow runs, so that sandbox constraints, credential handling, and permission boundaries behave exactly as in production. Using GitInject, we study workflow configurations across four AI providers and document eleven named attacks spanning config-file injection, credential exfiltration, judgment manipulation, and availability. We find that all tested providers are susceptible to at least one attack class in their default configuration, and that the most critical vulnerabilities are structural: they arise from how CI/CD infrastructure handles credentials and configuration files, not from any specific model’s behavior. For each confirmed attack class, we identify the minimum-cost workflow-level countermeasure and analyze its coverage and limitations. GitInject is released publicly to facilitate further research in this direction.

[AI-141] When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

链接: https://arxiv.org/abs/2606.09932
作者: Runze Liu,Jiashun Liu,Xu Wan,Yuqian Fu,Ling Pan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become a standard pipeline for Large Language Model (LLM) post-training. SFT is expected to provide a useful behavioral prior for RL to further enhance model capabilities. However, checkpoints with excessive SFT often show limited improvement during RL. We attribute this failure to the loss of model plasticity: the reduced ability of an SFT-initialized policy to be effectively reshaped by subsequent RL. To better understand this phenomenon, we conduct detailed analysis from multiple perspectives, including parameter changes, output spaces, and RL optimization dynamics. Our results show that models from excessive SFT tend to produce over-confident token distributions and exhibit sharp parameter landscapes, which make them harder to optimize in the RL stage. To enable a more robust SFT-to-RL handoff, we propose \textttRejuvenation, a simple yet effective method that restores plasticity while preserving useful SFT-acquired priors. Rejuvenation leverages base-anchored model fusion to reduce excessive SFT-induced drift with targeted neuron reset to mitigate model rigidity. Experimental results on both math reasoning tasks and agentic tasks demonstrate that our approach consistently improves RL performance on over-trained SFT models, while also enhancing generalization to out-of-distribution tasks.

[AI-142] A Note on the Strategic Confinement Problem

链接: https://arxiv.org/abs/2606.09931
作者: Christian Schroeder de Witt
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lampson’s confinement problem asks how to prevent a program that processes confidential information from leaking it to a third party. We introduce the strategic confinement problem, which arises when the communicating parties are strategic agents with shared coordination resources. In this setting, residual communication capacity can be concentrated on low-entropy, high-impact predicates of the confidential data. Consequently, bounds on information leakage need not induce corresponding bounds on worst-case harm: a channel with negligible capacity may still suffice to select damaging outcomes. We argue that systems of learnt strategic agents naturally instantiate this problem because they do not admit complete behavioural specifications, their learnt conventions generally cannot be predicted or reproduced by an external observer, and sufficiently capable agents can construct covert communication schemes that are difficult to detect or eliminate. Our contribution is therefore not a new theory of communication, but a reinterpretation of confinement in the presence of strategic agents. Classical confinement bounds what information may flow; strategic confinement highlights that this need not bound what strategic agents can jointly achieve.

[AI-143] Between Amnesia and Chaos: A Memory Stability Expressivity Trilemma for Trainable Dissipative Oscillator Networks

链接: https://arxiv.org/abs/2606.09929
作者: Caleb Munigety
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physical reservoir computing harnesses nonlinear mechanical dynamics but, by convention, freezes the substrate and trains only a linear readout, presuming the substrate is not usefully trainable. We revisit that premise for networks of nonlinear oscillators whose mass, damping, and stiffness are learned end-to-end through a symplectic integrator. Our central result is a trilemma: memory horizon, gradient stability, and dynamical expressivity cannot be simultaneously maximized, because all three are governed by the damping. The backward gradient decays at a rate set by the damping, capping how far back credit can propagate, while forward sensitivities grow exponentially in the largest Lyapunov exponent, so usable gradients require damping above a stability floor. Since the Lyapunov exponent falls as damping rises while the memory ceiling falls as the horizon grows, stable training is confined to a band that contracts with horizon and closes at a critical point. We test every step on a twenty-oscillator network. A damping sweep finds the largest Lyapunov exponent monotone and crossing zero at a well-defined stability floor, confirming the theorem’s key assumption. A compute-matched comparison of learned versus frozen substrate on delayed recall across nine horizons shows the learned substrate dominating at short horizons and the advantage closing and reversing near a horizon of eleven steps, the predicted signature of band closure; trained models settle near the stability floor, seeking the edge of chaos unprompted. The analytic ceiling overestimates the empirical crossover roughly fivefold, a gap between detectable and learnable gradient that we report rather than tune away. The contribution is a confirmed account of when training a physical substrate beats freezing it.

[AI-144] Forward-Only Convolutional Neural Networks with Learnable Channel-Class Assignment

链接: https://arxiv.org/abs/2606.09928
作者: Mohammadnavid Ghader,Saeed Reza Kheradpisheh,Bahar Farahani,Mahmood Fazlali
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Forward-Forward (FF) algorithm offers a biologically inspired alternative to backpropagation by replacing gradient-based credit assignment with local, forward-only objectives. While recent extensions have adapted FF to convolutional neural networks (CNNs), existing formulations rely on static channel-class partitions and struggle to perform effectively in complex tasks. In this work, we introduce a learnable channel-class assignment mechanism that enables adaptive, data-driven specialization of convolutional channels, supported by entropy and orthogonality regularization to promote learning performance. We further propose a loss-aware layer contribution strategy that adaptively weights intermediate-layer predictions based on their validation performance, enhancing the effectiveness of forward-only inference. Integrated into residual CNNs, the proposed method achieves consistently superior performance across CIFAR-10, CIFAR-100, and Tiny-ImageNet compared to existing similar forward-only methods. Notably, it establishes new state-of-the-art performance among FF-based models, substantially narrowing the gap with backpropagation. These findings demonstrate that introducing learnable channel specialization and layer contribution weighting significantly enhances the representational capacity of forward-only learning in deep CNNs.

[AI-145] Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

链接: https://arxiv.org/abs/2606.09926
作者: Hong Guo,Nianhui Guo,Christoph Meinel,Haojin Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sampling from the sequence-level power distribution p^\alpha elicits RL-level reasoning from base language models without any parameter updates, but the standard Metropolis–Hastings (MH), a Markov Chain Monte Carlo (MCMC) sampler, is both expensive and slow-mixing. We trace both to a structural mismatch: p^\alpha mainly departs from p at a sparse, spatially clustered set of high-entropy decision points, yet MH proposes resampling positions uniformly along the prefix – wasting compute on near-degenerate conditionals while under-mixing precisely where modes diverge. We propose Entropy-Guided Power Sampling (EGPS), a training-free and verifier-free sampler that re-derives its proposal from token-level entropy already in the forward pass. EGPS skips deterministic blocks, localizes each MCMC move to a high-entropy neighborhood, and applies Multiple-Try Metropolis at decision points – making sampling cost scale with \emphentropy mass rather than sequence length. On Qwen2.5-Math-7B, EGPS reaches best or tied-best accuracy on all three benchmarks (MATH500 75.8% , HumanEval 62.2% , GPQA 42.4% ) at up to a 12.6\times wall-clock speedup over the MH baseline.

[AI-146] Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

链接: https://arxiv.org/abs/2606.09924
作者: Kohga Tanaka,Hiroaki Nishi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying deep neural networks on memory-constrained edge accelerators is bottlenecked by per-inference off-chip weight transfer rather than computation: the dense network cannot be retained on-chip, and every parameter must be loaded for every input. Existing model compression reduces this transfer only at the cost of permanent capacity loss. We propose Sigma-Branch (SigmaB), a framework that restructures a pretrained dense network into a hierarchical binary tree composed of a shared backbone, hierarchical routers, and specialized leaves. Pretrained weights are distributed across the tree via activation-based spherical k-means clustering, which jointly initializes router weights and per-branch channel allocations; soft-routing fine-tuning then aligns each leaf with its routed input subset. At inference, the resulting network executes only a single root-to-leaf path, reducing the active-parameter footprint while storing the complete dense parameter set in memory. Across CIFAR-100 / ResNet-50, ImageNet-1K / ResNet-50, and ModelNet40 / PointNet++, SigmaB-Net reduces per-inference active parameters by 58-60% while remaining within 1.72 percentage points (pp) of the dense baseline Top-1. At comparable ImageNet-1K Top-1, the active-parameter reduction exceeds static structured pruning (FPGM, HRank) by 14-23 pp. The cross-modal evaluation, spanning 2D vision and 3D point-cloud backbones, substantiates a framework-level claim that decouples per-inference memory traffic from the total parameter count.

[AI-147] Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation

链接: https://arxiv.org/abs/2606.09923
作者: Michael Chin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 tables, 7 figures. Full-scale experiments on NVIDIA V100

点击查看摘要

Abstract:Neural operators such as the Fourier Neural Operator (FNO) have emerged as powerful surrogates for solving partial differential equations (PDEs), achieving speedups of several orders of magnitude over traditional numerical solvers. However, deploying these models in safety-critical engineering applications – such as thermal management of electronic components and battery systems – requires not only accurate point predictions but also rigorous uncertainty guarantees. Existing uncertainty quantification (UQ) methods for neural operators, including Monte Carlo Dropout and Deep Ensembles, provide only relative uncertainty estimates without formal coverage guarantees. In this work, we propose the first application of split conformal prediction to neural operator-based physics simulation, providing distribution-free prediction intervals with finite-sample coverage guarantees. We further introduce a normalized conformal prediction scheme that leverages MC Dropout uncertainty to produce adaptive-width intervals, yielding tighter intervals in regions of low uncertainty and wider intervals where the model is less certain. Full-scale experiments (33.7M parameters, 800 training samples, 5 ensemble members, NVIDIA V100) on steady-state heat conduction benchmarks demonstrate that our method achieves 89.1% empirical coverage at the target level of alpha=0.1, while producing spatially adaptive prediction intervals that reflect the underlying physical uncertainty structure. We also provide an uncertainty decomposition framework that separates epistemic uncertainty (68% of total) from aleatoric uncertainty (32% of total), offering actionable guidance for data collection and model improvement. Our method is implemented in an open-source platform with REST API endpoints and interactive 3D visualization.

[AI-148] he Bioelectrical Information Theory: Investigating the theoretical compression limit of bioelectrical signals under artificial intelligence

链接: https://arxiv.org/abs/2606.09922
作者: Jiawen Zou,Bo Yan
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bioelectrical signals are increasingly acquired at scales that challenge the bandwidth of brain-computer interfaces. However, their compression is still often framed as a problem of waveform preservation, limited by the entropy of the raw signal. Here we propose an information-theoretic framework in which the effective information of bioelectrical data is determined not only by signal fidelity, but also by physiological structure, model capacity and downstream task requirements. We formulate bioelectrical compression as a three-level hierarchy. At the signal level, noise is reduced to the information they carry about latent physiological sources. At the physiological level, parametric encoders map purified signals into compact, structured and quantized representations. At the semantic level, task-irrelevant information is discarded, while deep learning models exploit causal dependencies to replace marginal entropy with conditional entropy. This perspective reframes the compression limit of bioelectrical signals as a model- and task-conditioned quantity rather than a fixed property of the waveform. As increasingly expressive models become integrated with neural and physiological interfaces, bioelectrical compression may shift from transmitting signals to transmitting only the residual information required for task-level interpretation.

[AI-149] IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

链接: https://arxiv.org/abs/2606.09916
作者: Junjie Li,Jiong Lou,Jie Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory and KV read bandwidth grow by orders of magnitude across a single trajectory, making the key-value (KV) cache, not parameter compute, the dominant serving bottleneck for long-horizon agents. We introduce IntentKV, learned KV pruning that keeps the base LLM frozen. IntentKV maintains a session-level QueryMemory of cross-turn intent, scores live history tokens with a memory-attention rule, and adds a zero-initialized residual head with cross-attention over current-query K-vectors. To stay composable with prefix caches, eviction is a slot-map redirection: dropped positions route to a sentinel dead slot while surviving K/V rows, RoPE phases, and slot identities stay in place. IntentKV matches the no-pruning full-cache baseline with almost no accuracy drop under tight KV budgets: at an 8k KV budget, mean peak request tokens drop 23.9% on Qwen3-8B and 30.7% on Qwen2.5-14B. On the 100 longest BCP queries that all methods complete on Qwen2.5-14B, IntentKV-8k further cuts worst-case peak request tokens from 92.3k to 20.5k, a 77.8% reduction, and worst-case raw KV reads from 411M to 31M, a 92.6% reduction.

[AI-150] Mix Dont Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining ICML2026

链接: https://arxiv.org/abs/2606.09912
作者: Aaryan Nagpal,Debdeep Sanyal,Murari Mandal,Dhruv Kumar,Saurabh Deshpande
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the ICML 2026 Workshop on Foundation Models for Structured Data (FMSD), Seoul, South Korea

点击查看摘要

Abstract:Choosing the wrong synthetic generator for time-series foundation model pretraining is costly: under identical training budgets, the best and worst generators produce up to a 2\times gap in forecasting error, yet the field has no principled way to make this choice. The problem is compounded by the fact that generator rankings are not stable across architectures: across 11 generator families evaluated on Chronos-T5-Mini and Moirai-Small trained from scratch, we find that which generators are useful depends on the model architecture. Rather than solving the generator selection problem, we sidestep it: a simple equal-weight mixture of all generators matches or beats the best individual generator for both architectures, and composing this mixture with real data yields the strongest pretraining corpora overall. Synthetic pretraining is therefore a corpus composition problem, not a generator selection problem, and composition choices should be validated per model family rather than assumed to transfer.

[AI-151] IDP-Bench: Benchmarking ability of LLM s to protect personal information in interdependent privacy contexts

链接: https://arxiv.org/abs/2606.09908
作者: Ayana Hussain,Soumya Sharma,Golnoosh Farnadi,Nicholas Vincent,Héber Hwang Arcolezi,Ulrich Aïvodji
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are becoming widely deployed as personal AI assistants with access to sensitive user data, making privacy a major challenge for their design and evaluation. Prior work focuses mainly on individual-level risks, overlooking \textbfinterdependent privacy (IDP)–where one person’s data may be revealed by others without their knowledge or consent. We address this gap by introducing \textbfIDP-Bench: the first LLM benchmark for IDP scenarios, grounded in the Contextual Integrity (CI) framework. We evaluate eight open-source LLMs on their understanding of IDP scenarios across three levels of IDP reasoning using two LLM judges. Results show strong co-ownership recognition (6/8 models exceed 90%) but persistent weaknesses in identifying CI parameters (information attribute, primary subject) and IDP-specific parameters such as secondary subjects, where 7/8 models score below 74%. Models also struggle to judge sharing appropriateness (5/8 scoring below 77%). While the ability to judge the appropriateness of sharing improves with scale, performance tends to decline in smaller models, and prompt sensitivity remains high on IDP-specific questions–highlighting the need for more targeted study of IDP in LLM privacy research. Data \ code available \hrefthis https URLhere.

[AI-152] LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

链接: https://arxiv.org/abs/2606.09907
作者: Maxx Richard Rahman,Prakhar Kumar,Wolfgang Maass
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal clinical learning is increasingly important for integrating diverse patient data, including imaging, text, and personalised health records. However, it faces two fundamental challenges: i) modality missingness, where arbitrary subsets of modalities are unavailable at a given patient visit, ii) longitudinal dynamics, where the diagnostic significance of an observation depends on the patient’s evolving disease trajectory over time. Existing methods address these challenges in isolation: missing-modality frameworks treat each visit as an independent static snapshot and discard temporal context, while longitudinal models often assume complete modality availability and degrade under systematic modality incompleteness. We propose LongMoE (Longitudinal Mixture-of-Experts), the unified framework to jointly address both challenges. LongMoE combines a context-aware imputation module with an attentional tokenization module that captures frequency-domain temporal patterns across irregular visit sequences, a trajectory-aware encoder for modeling disease progression, and context-conditioned Sparse MoE routing for patient-specific expert selection. Experiments on ADNI, OASIS-3, and MIMIC-IV show that LongMoE improves robustness under missing or weak contemporaneous modalities and remains competitive in full-modality settings, establishing a strong foundation for longitudinally-aware multimodal clinical learning.

[AI-153] he Whale That Outswam Evolution: Swarm Intelligence Maximises Memory in Connectome Reservoirs

链接: https://arxiv.org/abs/2606.09902
作者: Anmol Guragain,Savvas Kakalis,Juan Ignacio Godino-Llorente
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reservoir computing exploits the fixed dynamics of a recurrent network for temporal processing, requiring only a trained linear readout. Biological neural connectomes, shaped by millions of years of evolution, may encode computational structure beyond what random reservoirs provide, yet whether that structure can be further enhanced by principled optimisation remains an open question. We address it by applying four gradient-free, bio-inspired optimisers (Particle Swarm Optimisation, Differential Evolution, Grey Wolf Optimiser, and Whale Optimisation Algorithm) to the edge weights of connectome-based echo-state networks across six species spanning six orders of magnitude in neural complexity: C. elegans (279 neurons), Drosophila (49 nodes), mouse (112), rat (73), macaque (29 regions, continuous FLNe synaptic strengths), and human structural MRI connectivity (83 parcels). Each connectome is evaluated on four canonical reservoir computing benchmarks: Memory Capacity (MC), Lorenz attractor prediction, NARMA-10 system identification, and Mackey-Glass chaotic time-series prediction. All four optimisers consistently outperform unoptimised biological baselines across every task and species when initialised from biological weights. WOA achieves the largest gains on every task: up to a 17x MC improvement (C. elegans: 1.39 to 23.91) and up to 89% NRMSE reduction (Mackey-Glass, human), corresponding to an average 214% improvement across all species and tasks. Crucially, random initialisation on the same topology reliably underperforms biology, establishing biological weight values as an essential inductive bias that topology alone cannot recover. These results position bio-inspired, biologically-initialised optimisation as a principled and broadly effective strategy for connectome reservoir computing across the animal kingdom.

[AI-154] When Attribution Patching Lies: Diagnosis and a Second-Order Correction

链接: https://arxiv.org/abs/2606.09899
作者: Luyang Zhang,Jialu Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 12 figures

点击查看摘要

Abstract:A central goal of mechanistic interpretability is to identify which internal components causally drive a language model’s behavior. Because these importance estimates serve as the evidence for identifying circuits, systematic errors can lead to the misidentification of the underlying mechanisms. While activation patching provides a gold-standard causal metric, its computational cost is prohibitive at scale. Practitioners instead rely on attribution patching, a gradient-based, first-order approximation whose reliability remains poorly understood. In this work, we characterize the source of this unreliability, demonstrating that the dominant error stems from the non-linearities in the downstream network rather than local curvature at the patched component. This insight yields three practical tools: (i) a reliability score to detect untrustworthy estimates, (ii) error bounds quantifying potential attribution mis-specifications, and (iii) a Hessian-vector-product (HVP) correction that eliminates the leading-order error with only one additional backward pass. In evaluations across five model families (124M-9B parameters) and both random-token and naturalistic (name-swap) perturbations, HVP is the only second-order correction feasible at larger scale, where standard baselines like Integrated Gradients become computationally prohibitive. In comparative experiments, a multi-step HVP variant matches or exceeds the accuracy of Integrated Gradients at significantly lower compute, outperforming prior second-order baselines. These improvements lead to higher-fidelity circuit recovery on standard benchmarks and support a Screen-Flag-Fix workflow that targets computational effort only toward the components flagged as unreliable.

[AI-155] HMAF: A Hierarchical Multi-Slot GD-RTB Allocation Framework KDD2026

链接: https://arxiv.org/abs/2606.09896
作者: Tianxing Bu,Zhaoqi Zhang,Linyou Cai,Miao Xie,Shengri Xue,Tan Qu,Qianlong Xie,Xingxing Wang,Siqiang Luo,Gao Cong
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by KDD 2026 Applied Data Science Track

点击查看摘要

Abstract:In modern online advertising platforms, Guaranteed Delivery (GD) contracts coexist and bid with Real-Time Bidding (RTB) auctions. Recent approaches either decouple GD and RTB optimization or rely on heuristic priority rules, and thus fail to effectively balance short-term revenue maximization with long-term contract delivery under complex multi-slot delivery and impression constraints. To address these challenges, we propose HMAF (Hierarchical Multi-Slot Allocation Framework), a unified framework designed to optimize impression allocation in GD–RTB advertising platforms. HMAF employs the Plan–Calibrate–Execute paradigm as its core structure, and integrates offline constraint optimization with online decision-making, balancing offline GD resource planning, dynamically calibrating GD–RTB competitiveness, and making real-time listwise rank decisions across multi-slot environments. HMAF has been implemented in multiple marketing scenarios at Meituan, one of the world’s largest online food delivery platforms, leading to a 3.72% increase in GD delivery rate and a 1.59% increase in total advertisement revenue.

[AI-156] SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLM s

链接: https://arxiv.org/abs/2606.09886
作者: Yuhao Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) large language models achieve strong quality with low per-token compute, yet their deployment is often limited by the memory wall: the full expert pool must remain resident to support token-dependent routing. Expert pruning is a direct remedy, but prior criteria typically score experts independently and overlook that MoE inference is inherently \emphcoalitional, where outputs arise from routed top- k expert combinations. We propose \textbfSHAPE, a task-driven pruning framework that explicitly models \emphintra-layer expert cooperation. SHAPE formulates routing traces on a small calibration set as an empirical cooperative game and assigns interaction-aware expert values via a Shapley-style attribution over observed top- k coalitions, enabling the identification of experts that are essential for high-utility collaborations rather than merely frequent. To preserve MoE topology under a global pruning budget, SHAPE further introduces a \emphquality-coverage selection rule that retains, in each layer, the minimal expert subset covering an \alpha fraction of non-negative Shapley mass, while using bisection to match a target keep rate. Experiments on three modern MoE backbones (Qwen3-30B-A3B, GPT-OSS-20B, and DeepSeek-V2-Lite) across diverse benchmarks show that SHAPE consistently improves robustness over global and layer-wise pruning variants, maintaining competitive accuracy under 20% and 40% expert pruning without additional training and delivering clear reductions in peak GPU memory footprint. The open-source code is available at this https URL.

[AI-157] D-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

链接: https://arxiv.org/abs/2606.09883
作者: Ningyuan Xi,Hao Xu,Hongsheng Xin,Ning Miao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by post-training paradigms, especially reinforcement learning with verifiable rewards (RLVR). However, a critical bottleneck persists: RLVR fails on highly challenging zero-reward problems, where all sampled reasoning trajectories yield uniformly failed outcomes, providing no optimization signal to drive model improvement. Prior efforts to address this limitation, such as dense process supervision, partial reward assignment, or prefix-guided exploration, suffer from inherent task constraints or do not fully equip the policy model with the capabilities necessary to solve the original intractable problems. To address this, we propose TD-Grokking, a training-time decomposition framework for zero-reward problems. It recursively decomposes intractable root problems into self-contained, verifiable subproblems, forming hierarchical trees where solvable leaves provide non-zero rewards. Evaluations on mathematical and medical tasks show that TD-Grokking outperforms vanilla GRPO as well as all baseline approaches. Together with detailed analysis, these results confirm that training-time decomposition effectively converts zero-reward examples into usable training signals, enabling consistent performance gains. Our code and datasets are available at this https URL.

[AI-158] Integrating Local and Global Entropy for Uncertainty Quantification in LLM s

链接: https://arxiv.org/abs/2606.09875
作者: Johanne Medina,Tianyi Zhou,Keivin Isufaj,Aristides Gionis,Sanjay Chawla
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 17 pages, 2 figures

点击查看摘要

Abstract:Large language models hallucinate confidently, making uncertainty quantification (UQ) essential for reliable deployment. Existing methods rely predominantly on token-level signals, leaving the geometric structure of intermediate hidden states underused. In this paper, we take the geometric complexity of hidden-state matrices as a measure of the global uncertainty of LLMs, while treating token-level uncertainty estimation as a local metric. We show that hidden-state geometric entropy (global uncertainty) and token-level entropy (local uncertainty) are statistically near-orthogonal, capturing distinct failure regimes for reliability prediction. In particular, global geometry recovers the confident-but-wrong failure mode that local signals systematically miss. Building on this, we propose Global-Local Uncertainty (GLU), an unsupervised, single-pass score that fuses the two signals via a multiplicative gate. Across three model families and six benchmarks, GLU matches or outperforms all unsupervised baselines while requiring only a single forward pass and remaining length-normalized and architecture-agnostic.

[AI-159] Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

链接: https://arxiv.org/abs/2606.09873
作者: Aditya Sharma,Christopher J. Pal,Amal Zouaq
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning models achieve strong performance on challenging tasks by generating explicit intermediate reasoning traces before producing a final answer. Yet the internal structure of representation space when reasoning remains poorly understood: how do a model’s hidden representations differ during thinking versus the embeddings of the input prompt, and can this structure be exploited to elicit stronger reasoning at inference time? We show that both input embeddings and thinking embeddings (mean-pooled last-layer hidden states over the prompt and reasoning trace, respectively) exhibit extremely high conicity, with all vectors clustering tightly around a single mean direction. Crucially, these mean input and thinking directions are non-collinear, with thinking embeddings occupying a geometrically distinct region of embedding space across many different models and benchmark tasks. This observation motivates casting the input-to-thinking transition as a rotation problem admitting a closed-form solution via orthogonal Procrustes analysis. We propose Rotate2Think, a training-free method that estimates this rotation from a small set of correctly solved examples and injects the resulting synthetic thinking vector between thinking delimiters at inference time, providing a geometric primer at the onset of the reasoning trace. Evaluated across multiple benchmarks and model families, Rotate2Think improves accuracy in 30 of 32 model-benchmark configurations across mathematics, science, and code tasks, and generalizes zero-shot to multimodal reasoning on MATH-Vision.

[AI-160] PatchSTG: Scalable Spatiotemporal Graph Transformers for Traffic Forecasting on Irregular Sensor Networks

链接: https://arxiv.org/abs/2606.09872
作者: Jichao Li,Xuanming Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages,12 figures

点击查看摘要

Abstract:Traffic forecasting is a fundamental component of intelligent transportation systems, yet remains challenging in real-world settings due to irregular sensor distributions and the high computational cost of modeling large-scale spatiotemporal dependencies. In practical traffic networks, sensors are unevenly distributed across regions, leading to non-uniform spatial structures that limit the effectiveness and scalability of existing graph-based and attention-based models. To address these challenges, we propose PatchSTG, a patch-based spatiotemporal graph Transformer designed for efficient forecasting on irregular sensor networks. The key idea is to introduce a hierarchical spatial representation that partitions sensors into balanced, locality-preserving patches based on geographic information. On top of this structure, a dual attention encoder alternates between intra-patch attention for capturing local interactions and inter-patch attention for modeling global dependencies, reducing computational complexity from quadratic to near-linear scaling. We evaluate PatchSTG on real-world traffic data from Rhode Island and additional large-scale datasets. Experimental results demonstrate that the proposed model achieves stable and competitive forecasting performance across multiple horizons, while significantly improving computational efficiency. Ablation studies further validate the effectiveness of spatial partitioning and dual attention in capturing both local and long-range traffic dynamics. These results suggest that patch-based spatiotemporal modeling provides a scalable and effective framework for traffic forecasting under irregular spatial settings.

[AI-161] QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning ECML-PKDD2026

链接: https://arxiv.org/abs/2606.09869
作者: Nazmus Shakib Shadin,Xinyue Zhang,Jingyi Wang,Miao Pan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by ECML-PKDD 2026

点击查看摘要

Abstract:Federated Learning (FL) combined with Split Learning (SL) is a privacy preserving paradigm that enables training deep neural networks (DNNs) on resource constrained devices while reducing overall training cost. However, determining the optimal split point, meaning the layer where the model is divided still remains a critical challenge, especially when clients have heterogeneous hardware capabilities. Fixed split points can overload weak devices and increase the communication and server load, which slows convergence and reduces stability. This paper introduces QSplitFL, a novel capability-aware Deep Q-Network (DQN) framework for optimal split point selection in Split learning based Federated Learning (SFL) environments. Unlike existing approaches that rely on high-dimensional model weight representations, QSplitFL employs a lightweight state representation derived directly from client hardware metrics, including CPU utilization, memory, battery level, and network latency. The proposed framework incorporates a decayed loss-drop reward function that prioritizes early convergence, and a committee-based DQN architecture with majority voting to mitigate reward hacking. Extensive experiments on MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets using CNN, ResNet50, MobileNetV4, and ConvNeXt architectures demonstrate that our approach achieves better convergence and higher accuracy compared to existing methods, while effectively adapting to heterogeneous device resources. The source code is publicly available at this https URL.

[AI-162] SPACE: Source-free Proxy Anchor Concept Erasure for MLLM s

链接: https://arxiv.org/abs/2606.09868
作者: Zhijing Zhang,Jiaqi Ding,Qianshan Wei,Nan Zhou,Jiaqi Li,Yongliang Wu,Tongxin Zhu,Xiaolin Fang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a crucial solution for removing sensitive data while preserving model performance. However, existing MU methods typically rely on visual data of the target concepts, which is often unavailable due to strict data retention policies, thus creating a demand for source-free unlearning approaches that operate without access to the target data. In this work, we propose Source-free Proxy Anchor Concept Erasure (SPACE), the first source-free unlearning framework specialized for MLLMs. SPACE consists of two stages: (1) Text-Guided Proxy Anchor Selection (TPAS), which retrieves semantically aligned proxy anchors from the shared feature space. (2) Dual-Constraint Semantic Isolation (DCSI), which optimizes these anchors to indirectly erase target concepts. DCSI confines updates to the null space of retained knowledge, ensuring structural integrity. We theoretically prove that SPACE strictly bounds the perturbation on retained knowledge and maximizes feature spectral entropy, thereby maintaining the model’s performance. Furthermore, extensive experiments across six datasets show that SPACE achieves performance comparable to that of state-of-the-art data-dependent methods, validating its effectiveness in source-free MU scenarios. The source code will be released.

[AI-163] EstRTL: Functional Estimation Guided RTL Code Generation

链接: https://arxiv.org/abs/2606.09867
作者: Qi Xiong,Renzhi Chen,Bowei Wang,Yuqing Xiong,Libo Huang,Lei Wang
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing register transfer level (RTL) code is of vital importance in hardware design. Large language models (LLMs) provide new methods for the automatic generation and optimization of RTL code, offering the potential to significantly accelerate the design process and reduce human effort. However, existing methods for generating RTL code often focus on model fine-tuning and the use of various expansion techniques to enhance the RTL code generation capabilities, lacking attention to the functional correctness. Ensuring that the generated RTL code not only compiles successfully but also behaves as intended in real hardware implementations remains a critical challenge. To address this issue, we propose EstRTL, an LLM-powered collaborative agent framework for RTL code generation based on static functional score estimation. EstRTL operates a three-stage paradigm: Generation, Estimation and Correction. During the stages, the functional estimation agent statically evaluates the generated code based on score and assessment results, and decides whether to output the code directly, return it for regeneration, or forward it to the code correction agent. This framework can be applied to various LLMs that designed for RTL code generation, further enhancing the correctness of the generated code. By providing quantitative scores and human-readable requirements comparisons, it improves the transparency of AI-assisted RTL code generation. Experiments show that EstRTL significantly improves the correctness of RTL code generation by generic LLM by 3.2%-9.0%, demonstrating the practical value of our system. The codes and experimental results are open-sourced at link: this https URL.

[AI-164] wo to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

链接: https://arxiv.org/abs/2606.09866
作者: Xinrui Chen,Jianhao Zhang,Ou Wu,Di Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our diagnostics show task updates expose different safety constraints, motivating joint selection of relevant references and compatible task samples. We propose DualSelect, a coupled framework for task and reference selection that refreshes task conditioned safety references before filtering whole task samples compatible with the induced reference direction. Under a minimax view, DualSelect selects safety references with high preservation loss and task conflict, together with compatible task samples, through entropy-regularized scoring surrogates, lazy reference refresh, and gradient correction. On 1B-8B LLMs, DualSelect preserves safety without losing task utility; using the REDORCA judge, it improves Safety Avg. over the strongest baseline by at least 5.10 points and remains highest in Safety Avg. across judges with moderate overhead. This view extends to retention focused continual learning.

[AI-165] Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

链接: https://arxiv.org/abs/2606.09864
作者: Bruce Changlong Xu,Adarsh Kumarappan,Mu Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Preprint. 61 pages, 9 figures

点击查看摘要

Abstract:Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignment preservation under KV cache quantization. Across eleven instruction-tuned models (3.8B-72B) and five benchmarks (1,894 prompts), we find that low-bit quantization can silently destroy safety alignment: Mistral-7B loses 15.2% of its refusals at only 1.03x perplexity, and no universal safe bit-width exists, with sharp model-specific phase transitions invisible to standard metrics. We identify that the root cause is geometric: safety features occupy a low-dimensional activation subspace 10^2-10^3x more vulnerable to quantization noise than the full representation space perplexity averages over. Inspired by this observation, we propose Per-Channel Reduction (PCR), a diagnostic that classifies each model into one of three mechanistic failure modes: outlier-crushes-safety, where safety lives in non-outlier channels collaterally damaged by outlier-driven scale factors; outlier-as-safety, where safety overlaps outlier channels and finer granularity cannot rescue it; and multi-layer dilution, where safety is distributed across many layers and per-layer fixes fail. PCR predicts the correct mitigation direction on all nine primary models and one held-out model from an independent family using 20 calibration prompts. PCR generalizes across unseen prompts, models, and production quantizers, including KIVI with up to 97.2% recovery, succeeding where attention-based allocation methods fail. The resulting training-free protocol, requiring approximately 35 GPU-minutes, recovers up to 97% of lost alignment at minimal memory overhead, addressing vulnerabilities confirmed in production vLLM serving with FP8 KV cache on NVIDIA GPUs.

[AI-166] Blurry Window Attention

链接: https://arxiv.org/abs/2606.09862
作者: Axel Laborieux,Christos Sourmpis,Juan Gabriel Kostelec,Qinghai Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size in the form of KV cache, which becomes a bottleneck in long context scenarios. To overcome this limitation, alternative architectures with linear complexity and finite state size have been introduced, such as State-Space Models (SSMs), Linear Attention (LA), and Attention with Bounded-memory Control (ABC). Though linear models achieve similar language perplexity as Transformers, they are still behind in tasks which require retrieval or recall of specific information. In this work, we introduce Blurry Window Attention (BLA) a novel ABC method inspired by SSMs. BLA stores a frequency window from which a blurry KV history is reconstructed via interpolation using Dirichlet kernels. BLA can be understood as a generalization of Sliding Window Attention (SWA) depending on the Dirichlet kernels resolution or as a special case of the Gated Slot Attention (GSA), where the decay factor is implemented with Dirichlet kernels. We describe in details the theory and efficient implementation of BLA. On the Multi-Query Associate Recall (MQAR) synthetic task, we show that the state efficiency of BLA is 8 \times better than SWA and is competitive with popular linear attention models, and in the RegBench synthetic task, only BLA and SWA improve their performance as the state size grows among the linear models we tested.

[AI-167] me Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

链接: https://arxiv.org/abs/2606.09861
作者: Yunhao Zhang,Ruiying Qi,Jiale Zheng,Jianfeng Zhang,Lujia Pan,Junchi Yan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete tokens, and UniTok-FM, a foundation model pretrained via NTP on these tokens. UniTok-FM is a general-purpose foundation model that supports zero-shot and prompt-boosted forecasting, as well as few-shot generation and classification via training-free in-context inference–a capability not achieved by prior works. Technically, UniTok is a vector-quantized autoencoder incorporating prefix normalization for scale stabilization, a progressive-resolution causal architecture for encoding and decoding, and a structure-preserving reconstruction loss for training. UniTok-FM adopts an off-the-shelf LLM architecture without TS-specific modifications. Instead of pretraining on isolated TS, it performs NTP on context windows formed by multiple series with similar patterns, aiming to capture their shared dynamics. Experiments on forecasting, generation, and classification show that a single unified UniTok-FM consistently outperforms statistical and supervised baselines, achieves competitive performance with task-specific foundation models, and uniquely enables training-free in-context inference across tasks.

[AI-168] Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverag es

链接: https://arxiv.org/abs/2606.09860
作者: Xinze Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Non-alcoholic fatty liver disease (NAFLD) affects roughly 25% of global adults, posing substantial hepatic and cardiovascular risks. Yet, population-level screening tools remain inadequate. We present Method, a machine-learning framework for NAFLD risk prediction coupling gradient-boosted decision trees with conformal prediction to yield calibrated, distribution-free coverage guarantees on individual risk estimates. It integrates a mutual-information-based stability selection procedure to identify a compact, clinically interpretable feature subset via bootstrap resampling, constructing prediction sets whose marginal coverage provably exceeds a user-specified confidence level. We evaluated Method on a multicenter cohort from Guangzhou, China (primary n=2,187; external validation n=412) using 78 candidate features across demographics, metabolic biomarkers, and lifestyle factors. Method achieves an AUROC of 0.912 internally and 0.891 externally, outperforming deep neural networks, TabNet, support vector machines, and logistic regression. Conformal prediction sets achieve 91.3% empirical coverage at the 90% nominal level. A three-tier risk stratification derived from these scores separates the population into distinct groups, with the high-risk subgroup showing a 12-month progression rate 4.7 times that of the low-risk tier. The selected features – notably waist circumference, ALT, GGT, triglycerides, fasting glucose, and BMI – align with established metabolic risk factors, providing biological plausibility.

[AI-169] Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding ICML2026

链接: https://arxiv.org/abs/2606.09859
作者: Yingxuan Zhuang,Jingxiao Yang,Miao Pan,Cheng Tan,Yuxiang Cai,Siwei Tan,Chen Zhi,Xuhong Zhang,Jianwei Yin,Jintao Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026 regular

点击查看摘要

Abstract:MLLMs frequently hallucinate objects inconsistent with visual inputs. This issue is typically attributed to the over-reliance on language priors, which can override the visual context. Recent training-free decoding strategies address this by penalizing language priors. However, these methods overlook the dual nature of language priors, where they can be both helpful and harmful depending on the alignment with visual evidence. In particular, blindly suppressing language priors often disrupts the model’s semantic manifold, leading to performance degradation, a phenomenon we term Manifold Departure. To address this, we propose Manifold-Guided Adaptive Projection (MGAP), a geometry-aware, training-free decoding method that mitigates hallucinations while preserving representation structure. MGAP first constructs a language-prior subspace from blind hidden states via SVD. During decoding, MGAP projects each multimodal hidden state onto this subspace and applies a consistency-aware gate to adaptively attenuate only the projected prior component, yielding a subspace-selective update that largely preserves the orthogonal semantic components. Extensive experiments on POPE and CHAIR show that MGAP outperforms prior decoding baselines, achieving stronger hallucination suppression without sacrificing coherence.

[AI-170] Support sufficiency as action-sufficient compression: a single-cycle rate-regret formulation

链接: https://arxiv.org/abs/2606.09858
作者: Mark Walsh
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 22 pages. Submitted to Journal of Mathematical Psychology. Formal single-cycle model of action-sufficient support compression and rate-regret sufficiency

点击查看摘要

Abstract:Robust decision-making requires compression. A system that forms a rich support state cannot usually preserve its full structure at the point of action. It must retain only those distinctions needed to act, verify, abstain, or defer under the current consequence geometry. This paper formalizes support sufficiency as action-sufficient compression. Let H denote a full support state, \mathcalA a finite action set, and Z a consequence geometry specifying payoff structure. For fixed Z , the coarsest exactly action-sufficient compression is the quotient of support space by policy equivalence. Two support states may be merged exactly when they require the same optimal action. This clarifies why content-only and scalar-confidence-only arbitration fail whenever their induced partitions cross action boundaries. Approximate sufficiency is then defined by bounded expected policy regret. In the finite single-cycle setting, this yields a rate-regret problem with source H , reproduction alphabet \mathcalA , and distortion given by consequence-sensitive regret. The optimal stochastic action channel inherits the standard rate-distortion Gibbs form, applied here to support states with regret distortion. The contribution is interpretive: action adequacy is distinguished from reconstruction fidelity, information-bottleneck prediction, and rational inattention. Robust single-cycle arbitration does not require preserving all support, but it does require preserving the distinctions that consequence geometry makes action-relevant.

[AI-171] Flaws in the LLM Automation Narrative

链接: https://arxiv.org/abs/2606.11166
作者: George Perrett,Javae Elliott,Jennifer Hill,Marc Scott
类目: Other Statistics (stat.OT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM training data, and they frequently do not assess the reliability of LLM performance or the magnitude of LLM errors. However, in high stakes contexts, these qualities are critically important. Through a novel LLM benchmarking task that requires writing computer code to complete a data analysis task, we compare the performance of a frontier LLM against submissions from human experts and explicitly measure the variance of responses and the magnitude of errors. Our study reveals that the human experts perform better on average on a range of metrics and demonstrate less variability in performance. Our results provide evidence that LLMs do not consistently perform at the level of human experts and demonstrate the importance of measuring variance and assessing error magnitude in LLM benchmark evaluations.

[AI-172] Data assimilation for subsurface flow using latent diffusion model parameterization: performance of ensemble-Kalman and Monte Carlo techniques

链接: https://arxiv.org/abs/2606.11140
作者: Guido Di Federico,Wenchao Teng,Louis J. Durlofsky
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Data assimilation (DA) in subsurface flow entails calibrating model parameters to match observed data, typically at wells, while preserving geological realism. Latent diffusion models (LDMs) provide efficient mappings from high-dimensional geological model space to a low-dimensional latent variable, reducing the dimensionality of the inverse problem while maintaining plausibility in posterior geomodels. However, the high nonlinearity in the LDM mapping may degrade the performance of Kalman-gain-based ensemble updates. We present a systematic comparison of DA algorithms applied to large-scale 3D channelized geomodels with hierarchical geological uncertainty. We compare model-space and latent-space DA using the ensemble smoother with multiple data assimilation (ESMDA), and demonstrate a key trade-off: model-space updates achieve significant uncertainty reduction but produce geologically unrealistic posterior models, while latent-space updates preserve realism but exhibit limited uncertainty reduction. Motivated by this, we explore rigorous Markov chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) algorithms in the 3D-LDM latent space. To accommodate their high computational demands, we develop a fast surrogate flow model that approximates well-rate responses. MCMC and SMC are evaluated against ESMDA across three synthetic test cases, with DA performed in the LDM latent space. All models maintain geological realism due to the LDM parameterization. MCMC and SMC are consistent with one another and achieve lower data mismatch and more uncertainty reduction than latent-space ESMDA. Our overall results demonstrate that ensemble Kalman methods may provide overestimated posterior uncertainty with highly nonlinear parameterizations, while rigorous Monte Carlo sampling, enabled by fast surrogate models, can provide a more reliable alternative.

[AI-173] Optimizing 2D Input Representations and Sub-phase Fusion Strategies for Differential Diagnosis of Asthma and COPD Using CNN- and GRU-Based Networks

链接: https://arxiv.org/abs/2606.10972
作者: Ipek Sen,Ozgur Ozdemir,Elena Battini Sonmez
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study aims to explore the performance of the VAR model in comparison with mel-frequency cepstral coefficient (MFCC) matrices and log-mel spectrograms using deep learning. In pulmonary sound classification, spectrogram-based representations suffer from inconsistent temporal dimensions due to varying respiratory cycle durations. Along with traditional trimming/zero-padding, adaptive-length windowing was presented to fix their temporal dimensions. Their spectral and temporal dimensions were optimized by testing a range of parameters. Different convolutional neural network (CNN) architectures were employed to extract features from the two-dimensional representations obtained over the sub-phases. The extracted sub-phase features were then fused using various strategies including direct concatenation, gated recurrent unit (GRU) network and GRU with attention mechanism. Model performances were assessed through respiratory cycle-based evaluation and subject-based evaluation comprising multiple respiratory cycles. Several data augmentation techniques were also studied to cope with limitations in data size. The best cycle-based F1-score (0.877) was obtained using the MFCC matrices with thirteen coefficients and 64-point time resolution per sub-phase representation followed by direct feature concatenation, and the best subject-based F1-score (0.855) was obtained using the MFCC matrices with thirteen coefficients and 256-point time resolution per full-cycle representation, both obtained by adaptive-length windowing. Augmentation degraded the performance of models overall, yet mixup augmentation was the best among the methods tested. MFCC outperformed log-mel spectrogram and VAR model in differentiation of asthma and COPD. Sophisticated fusion strategies did not improve the diagnosis. Augmentation did not contribute, demonstrating the significance of authentic data in pulmonary sound studies.

[AI-174] Human-AI Teaming Through the Lens of Calibration

链接: https://arxiv.org/abs/2606.10906
作者: Eric Nalisnick,Chi Zhang,Sophia Qian,Yixin Wang
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 5 figures (including appendix)

点击查看摘要

Abstract:We study models for human-AI teaming through the lens of statistical calibration. We assume the team consists of an AI model and human – both of which are calibrated with respect to some partitioning of the feature space – and expose how the calibration assumptions propagate into the teaming framework. In particular, we consider frameworks that either (i) combine human and model predictions or (ii) delegate prediction responsibility to either a human or model. We show via theoretical and empirical results that existing methods for combination do not preserve the human’s degree of calibration. Methods for delegation (by the very act of delegation) preserve calibration of the downstream predictors but shift the burden onto the rejector meta-model that decides who predicts. The rejector must be calibrated finely enough to locate where each member is superior, a demand that grows with the human’s expertise and becomes unattainable when the human relies on information the system cannot observe.

[AI-175] Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLM s via FOA Encoding

链接: https://arxiv.org/abs/2606.10738
作者: Zhiyuan Zhu,Yixuan Chen,Yiwen Shao,Wenxiang Guo,Changhao Pan,Yu Zhang,Yuxiang Wang,Wei Liu,Houhua Zhang,Chengkuan Zeng,Wenbo Cheng,Yunxi Liu,Rui Yang,Steve Yves,Liefeng Bo,Zhou Zhao
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at this https URL.

[AI-176] owards Critical Branching Mechanism in Recurrent Neural Networks

链接: https://arxiv.org/abs/2606.10384
作者: Feixiang Ren,Ling Feng
类目: Adaptation and Self-Organizing Systems (nlin.AO); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Criticality has been proposed as a key organizing principle in biological neural systems, yet its origin and relevance in artificial neural networks remain unclear. We analyze hidden-state dynamics in trained long short-term memory (LSTM) networks and show that small networks near their optimal training epochs (steps) exhibit scale-free avalanche statistics and branching parameters close to unity, indicative of near-critical dynamics, while larger models remain subcritical. To explain the coexistence of subcritical branching with robust 1/f^\beta noise, we introduce a mixture branching process framework that links heterogeneous branching dynamics to long-range temporal correlations. These results identify critical-like behavior in LSTMs as an emergent, capacity-dependent dynamical regime.

[AI-177] Hyperbolic Neural Population Geometry Benefits Computation ICML2026

链接: https://arxiv.org/abs/2606.10238
作者: Dennis Wu,Yi-Chun Hung,Braden Yuille,James E. Fitzgerald,Han Liu
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026, 37 pages, 5 figures

点击查看摘要

Abstract:Neural population geometry shapes downstream computation. Recent empirical findings in neurobiology suggest that a hyperbolic structure underlies population activity in the hippocampus. Here we provide a theoretical framework for this phenomenon. First, we propose a plausible construction of hippocampal tuning curves that statistically induces hyperbolic geometry. Next, we establish a connection between neural decoding and associative memory by demonstrating that the Modern Hopfield Network update rule computes the minimum mean-squared-error (MMSE) estimator. Finally, we introduce a novel associative memory model defined in hyperbolic space that yields significantly larger capacity than leading models. Our results suggest that animals encode spatial information as a latent hyperbolic cognitive map, improving both memory capacity and decoding accuracy.

[AI-178] Integral Field Unit Spectroscopy with One Fiber

链接: https://arxiv.org/abs/2606.10197
作者: Zehao Peng,Biprateep Dey,Chris J. Maddison,Joshua S. Speagle
类目: Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI)
备注: Accepted for Conference on Physics and AI at Stanford University (PAI 2026)

点击查看摘要

Abstract:Integral field unit (IFU) spectroscopy provides spatially resolved spectra across galaxies, offering crucial insights into their evolution. However, its high observational cost limits current IFU datasets to \sim 10^4 objects. We present a multi-modal, probabilistic foundation model that predicts high-resolution spectra with calibrated uncertainties at arbitrary spatial locations within a galaxy directly from broadband images. Built on a masked autoencoder framework, our architecture injects fiber positional encodings and redshift aware wavelength encodings, enabling spatially conditioned predictions. Trained on 4.7 million images and single fiber spectroscopic observations from the Dark Energy Spectroscopic Instrument (DESI) survey, our model exploits the natural variance of fiber placements and the morphological self-similarity of galaxies to achieve IFU-like capabilities without any IFU training data. Predicted emission line flux maps match independent IFU observations from the Mapping Nearby Galaxies at APO (MaNGA) survey, with performance comparable to a supervised baseline trained directly on IFU data.

[AI-179] DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

链接: https://arxiv.org/abs/2606.10010
作者: Chien-Chun Wang,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注: Accepted to IEEE Signal Processing Letters (SPL)

点击查看摘要

Abstract:Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman’s rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.

[AI-180] Geometry-Aware Anisotropic Boundary Correction for Aerodynamic Simulation

链接: https://arxiv.org/abs/2606.09963
作者: Xin Zhang,Yipeng Huang,Shu Jiang,Zhenzhong Wang,Min Jiang
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aerodynamic simulation is a key component of engineering shape design, where core quantities such as the surface pressure coefficient strongly depend on flow dynamics near solid boundaries. Neural operators provide an efficient alternative to expensive Computational Fluid Dynamics (CFD) solvers. However, conventional methods treat the boundary region isotropically, failing to account for the distinct physical behaviors along the boundaries. In reality, the aerodynamic process exhibits anisotropy: along the tangential direction, flow propagates along the wall; along the normal direction, physical quantities are constrained by the wall. To explicitly model the distinct physical behaviors, we propose GeoABC, a geometry-conditioned anisotropic boundary correction framework. GeoABC leverages the boundary geometries to introduce direction-aware boundary correction into the intermediate representations of neural operators, transforming boundary geometry from static input features into a structural prior that modulates physical prediction. On 2D airfoil and 3D car tasks, GeoABC consistently adapts to multiple neural operator backbones, reducing near-boundary relative L_2 error by \sim 38% on average, narrowing the structural near-wall gap shared by mainstream neural operators, and advancing neural operators toward high-fidelity aerodynamic simulation.

[AI-181] Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT

链接: https://arxiv.org/abs/2606.09953
作者: Luis Cortés Ferre,Miguel A. Gutiérrez-Naranjo,Marcin Balcerzyk
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anisotropy that degrades multiplanar reconstructions, volumetric measurements such as hematoma volume estimation, and downstream algorithms that assume near-isotropic voxels. We present a deep learning system that synthesizes intermediate CT slices from pairs of neighboring axial slices, halving the effective through-plane spacing. The system improves three-dimensional visualization while simultaneously producing inherently denoised outputs, yielding two complementary benefits from a single inference pass. To build a reliable system, we systematically evaluate pixel-wise losses, namely mean squared error (MSE) and mean absolute error (L1); structural-similarity losses, namely the structural similarity index (SSIM) and its multi-scale variant (MS-SSIM); and hybrid combinations. On a held-out test set, all converged models outperform classical interpolation baselines and pretrained video frame interpolation methods (RIFE, FILM) on all structural measures, with MS-SSIM+L1 offering the strongest balanced profile. We also document training instability in SSIM-family losses and identify partial remedies: the standard numerical fixes eliminate the dominant failure mode but leave residual divergence at smaller batch sizes. All results are reported with patient-level bootstrap confidence intervals and paired statistical tests. As an illustration, we apply the system to an out-of-distribution head CT series from Hospital Universitario Virgen del Rocío: the model synthesizes intermediate slices and exhibits on the real slices the implicit-denoising signature predicted by our theoretical analysis, supporting in a single external case that interpolation quality and implicit denoising are not confined to the training distribution. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.09953 [eess.IV] (or arXiv:2606.09953v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2606.09953 Focus to learn more arXiv-issued DOI via DataCite

[AI-182] GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-Aware Macroeconomic Welfare Monitoring

链接: https://arxiv.org/abs/2606.09944
作者: Sivasathivel Kandasamy
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:GDP per capita is the default lens through which governibng bodies track the economic prosperity and consequences of economic events , yet it is blind to two first-order determinants of lived prosperity: income/wealth distribution and inflation impact. Inequality-adjusted income measures are themselves not new but What is missing from the macroeconomic monitoring toolkit specifically is not a welfare concept but an operational monitoring trigger: a statistic minimal enough to compute annually from public data, transparent enough to audit without modelling assumptions, and normalised so that year-on-year, cross-country change ? the quantity a regulator needs to act on? is legible. We assemble such an instrument, the Gini- Adjusted GDP per Capita Index (GAGI): a reproducible, publicly computable formulation that rescales each country’s GDP per capita by its inequality-adjustment factor (1-G) and its price level, normalised to a 2010 baseline. GAGI is a general-purpose welfare index, not inherently specific to AI automation, applicable wherever welfare-adjusted prosperity needs tracking. Applying GAGI to the G7 economies over 2010-2026, we show that welfare-adjusted prosperity has diverged persistently and increasingly from headline GDP growth, that the divergence widens sharply after 2022, temporally coincident with, though not, on this evidence alone, demonstrated to be caused by the after effects of COVID and the acceleration of generative-AI deployment. We argue that GAGI is a necessary complement to GDP-based monitoring: any macroeconomic monitoring instrument that tracks only aggregate output will systematically miss the distributional harm that automation can cause even while reported growth remains strong.

[AI-183] ractogram foundation model

链接: https://arxiv.org/abs/2606.09893
作者: Guikun Chen,Yuqian Chen,Yijie Li,Yogesh Rathi,Nikos Makris,Fan Zhang,Wenguan Wang,Lauren J. O’Donnell
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion MRI (dMRI) tractography is the only noninvasive approach for mapping white-matter pathways in the living human brain. It represents each brain as a tractogram: a large, unordered set of three-dimensional streamlines that includes information about both local streamline geometry and whole-brain anatomical organization. This structure makes tractograms a natural but challenging target for representation learning. Existing methods treat streamline classification and subject-level prediction as separate problems: streamline classifiers focus on geometric patterns, whereas subject-level prediction often depends on hand-crafted features. As a result, current methods do not learn reusable representations that connect streamline anatomy with whole-brain inter-subject variation. Here we introduce TractFM, a tractogram foundation model that learns reusable representations directly from whole-brain streamline sets. TractFM combines a local streamline encoder with a permutation-equivariant tractogram encoder, allowing all streamlines from a subject to be contextualized jointly in a single forward pass. Pretraining on dense anatomical tract parcellation, i.e., assigning anatomical labels to individual streamlines, yields two complementary representations: contextualized streamline-level embeddings for tract parcellation and compact subject-level descriptors for downstream prediction of subject phenotypes. Across three tractography algorithms and five dMRI datasets, TractFM transfers to both streamline-level and subject-level tasks. Its frozen representations achieve accurate tract parcellation and predict age and sex across independent datasets. These results show that whole-brain geometric context, learned once, can generalize across tractography pipelines, datasets, and prediction tasks.

机器学习

[LG-0] When to Align When to Predict: A Phase Diagram for Multimodal Learning

链接: https://arxiv.org/abs/2606.11190
作者: Ilay Kamai,Hugues Van Assel,Aviv Regev,Hagai B. Perets,Randall Balestriero
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all – a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at this https URL.

[LG-1] Predicting Future Behaviors in Reasoning Models Enables Better Steering

链接: https://arxiv.org/abs/2606.11172
作者: Evgenii Kortukov,Piotr Komorowski,Florian Klein,Paula Engl,Gabriele Sarti,Seong Joon Oh,Sebastian Lapuschkin,Wojciech Samek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target. Instead, we train activation probes to predict future behavior likelihoods from intermediate reasoning steps. These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features. Building on these prediction features, we introduce a text-level steering method, Future Probe Controlled Generation. FPCG samples multiple candidate sentences and chooses the best one according to a probe predicting the future behavior likelihood. This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails. These results show that distinguishing detection and prediction features enables a more nuanced approach to controlling LRM behaviors.

[LG-2] Algorithmic and Minimax Complexities in Kernel Bandits

链接: https://arxiv.org/abs/2606.11171
作者: Yunbei Xu
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Gaussian-process upper confidence bound (GP-UCB) and decision-estimation-coefficient (DEC) methods may appear, at first sight, to belong to different theories. This paper places the two viewpoints in a common algorithmic-information language for frequentist RKHS bandits. GP-UCB fixes an algorithmic, rather than true, Gaussian-process prior and exploits realized-trajectory complexity together with computational tractability, whereas MAMS optimizes a robust class-wide MAIR/DEC envelope. Through the unified MAIR framework and heterogeneous positive-semidefinite algorithmic priors, we generalize both the GP-UCB analysis and the MAMS algorithm, propose a safeguarded master that combines their advantages, and provide a kernel-bandit construction showing that algorithmic complexity can be more informative than class-wide minimax or DEC certificates in overparameterized models. The resulting message is that algorithmic information and class-wide minimax coefficients answer different questions and can lead to different gaps; kernel bandits provide a clean setting in which this distinction becomes mathematically visible.

[LG-3] COGENT: Continuous Graph Emulators with Neural Ordinary Differential Equations for Long-Term Physical Forecasting

链接: https://arxiv.org/abs/2606.11162
作者: Zesheng Liu,Maryam Rahnemoonfar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we present COGENT, a continuous graph emulator with Neural Ordinary Differential Equations for long-term physical forecasting on irregular geospatial meshes. COGENT encodes a finite history of system states and associated forcing fields and external forcings with a graph-based history encoder, producing node-wise context vectors that capture both local spatial interactions and temporal evolution. These context vectors initialize and condition a latent Neural Ordinary Differential Equation whose dynamics are driven by interpolated future forcings and explicit relative rollout time. By modeling the forecast trajectory as a continuous latent dynamical system, COGENT can generate predictions at arbitrary future times rather than being restricted to a fixed temporal discretization. A residual decoder maps the resulting latent trajectories back to future physical states, enabling direct multi-step forecasting without repeatedly feeding predicted states back into the model. This formulation combines graph-based spatial representation, history-conditioned latent dynamics, and continuous-time rollout in a unified framework for mesh-based physical simulation emulation. In order to stabilize training with long-horizon supervision, we also propose effective rollout-horizon sampling and a progressive rollout-horizon scheduling strategy. We evaluate COGENT on transient ice-sheet simulations generated by the Ice-sheet and Sea-level System Model, demonstrating improved long-range stability over autoregressive graph baselines. These results suggest that continuous graph Neural ODEs provide a promising methodology for scalable physical forecasting on irregular geospatial meshes, particularly in applications that require stable long-horizon predictions and the ability to query system states at arbitrary times.

[LG-4] Efficiently Learning Drifting Halfspaces with Massart Noise ICML2026

链接: https://arxiv.org/abs/2606.11149
作者: Mingchen Ma,Guyang Cao,Jelena Diakonikolas,Ilias Diakonikolas
类目: Machine Learning (cs.LG)
*备注: To appear at ICML 2026

点击查看摘要

Abstract:We study the problem of learning a drifting concept in the presence of Massart noise. In this framework, an online learner has access to a history of independent samples whose labels are noisy versions of a target concept that may change from round to round. The goal is to output, in each round, a hypothesis with small prediction error. We study the complexity of this learning problem for the fundamental class of margin-separable linear classifiers (halfspaces). On the positive side, we give a computationally efficient learner achieving error \eta + \tilde O(\Delta^1/3/\gamma) , where \eta upper bounds the Massart noise rate, \Delta is the drift rate, and \gamma is the margin. Interestingly, in the realizable setting, an adaptation of our techniques yields an efficient learner with an improved error rate over prior work. On the lower-bound side, we provide formal evidence of an information-computation tradeoff, strongly suggesting that our algorithm’s performance is essentially optimal. Specifically, while the information-theoretically optimal error scales with \Delta^1/2 , we prove that \Delta^1/3 -scaling is unavoidable for low-degree polynomial tests, even in the special case of random classification noise.

[LG-5] OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib

链接: https://arxiv.org/abs/2606.11144
作者: Abhijoy Sarkar,Aarchi Singh Thakur
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM); Applications (stat.AP)
*备注: 24 pages, 7 figures, 4 tables. Code, data, and trained model weights: this https URL . Python package: pip install oncotraj. Dataset: this https URL

点击查看摘要

Abstract:Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and © six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1’s single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.

[LG-6] First-Order Trajectory Matching: Fast Ensemble Predictions of Chaotic Turbulent Stochastic Systems

链接: https://arxiv.org/abs/2606.11138
作者: Shreya Jha,Timo Schorlepp,Nicholas Geissler,Jules Berman,Benjamin Peherstorfer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We introduce First-Order Trajectory Matching (FTM), a surrogate-modeling method that learns the first-order local transport of probability mass from trajectories of stochastic systems. By matching the symmetric first-order motion of trajectories, FTM learns the probability current velocity, whose flow preserves time marginals to match ensemble averages, while also capturing current-like trajectory quantities such as fluxes, circulations, and barrier-crossing currents. FTM learns the current velocity directly from trajectories, avoiding drift, diffusion, and score estimation. Our stability analysis separates discretization error from sampling variance and shows that the one-step simulation-free FTM loss is stable when temporal resolution and sample size are properly balanced. Across stochastic dynamical systems and PDE examples, we empirically demonstrate that FTM provides trajectory-aware ensemble predictions at low, deterministic-rollout cost.

[LG-7] Robust Regression of General ReLUs with Queries NEURIPS2025

链接: https://arxiv.org/abs/2606.11130
作者: Ilias Diakonikolas,Daniel M. Kane,Mingchen Ma
类目: Machine Learning (cs.LG)
*备注: Appeared at NeurIPS 2025

点击查看摘要

Abstract:We study the task of agnostically learning general (as opposed to homogeneous) ReLUs under the Gaussian distribution with respect to the squared loss. In the passive learning setting, recent work gave a computationally efficient algorithm that uses poly(d,1/\epsilon) labeled examples and outputs a hypothesis with error O(opt)+\epsilon , where opt is the squared loss of the best fit ReLU. Here we focus on the interactive setting, where the learner has some form of query access to the labels of unlabeled examples. Our main result is the first computationally efficient learner that uses d polylog(1/\epsilon)+\tildeO(\min\1/p, 1/\epsilon) black-box label queries, where p is the bias of the target function, and achieves error O(opt)+\epsilon . We complement our algorithmic result by showing that its query complexity bound is qualitatively near-optimal, even ignoring computational constraints. Finally, we establish that query access is essentially necessary to improve on the label complexity of passive learning. Specifically, for pool-based active learning, any active learner requires \tilde\Omega(d/\epsilon) labels, unless it draws a super-polynomial number of unlabeled examples.

[LG-8] Overcoming Rank Collapse in Feedback Alignment

链接: https://arxiv.org/abs/2606.11123
作者: Gauthier Boeshertz,Razvan Pascanu,Claudia Clopath
类目: Machine Learning (cs.LG)
*备注: 9 pages and 4 figures, 1 table for main text. Total of 28 pages and 13 figures with appendix

点击查看摘要

Abstract:Backpropagation (BP) is widely viewed as biologically implausible, in part because it requires feedback weights to be the transpose of forward weights for error propagation. Interestingly, when training a network with fixed random feedback weights to circumvent this issue, learning aligns the forward weights with the feedback weights, leading the backpropagated error signal to become an approximation of the standard gradient used by BP. This process, called Feedback Alignment (FA), occurs in MLPs and very shallow CNNs but does not scale well to deeper architectures. In this work, we first investigated differences between BP and FA models, trained on CIFAR10, specifically focusing on the effective rank of the signal. We found that the FA error has a considerably lower rank and hence is constrained to a lower-dimensional subspace compared to BP, limiting exploration of the parameter space. Motivated by this observation, we evaluated two mechanisms for increasing the effective dimensionality of FA: Muon, an optimiser that orthogonalises weight updates; and hidden activity normalisation, which promotes activation orthogonality. Across larger architectures and benchmarks, we find that these methods consistently improve over FA baselines, for example, on CIFAR100 with a Resnet-18, accuracy increases by 9 percentage points. Our results identify low-dimensional gradient dynamics as a key obstacle to scaling FA and suggest that inducing higher-dimensional update geometry is a promising route toward scaling alternatives to backpropagation.

[LG-9] Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

链接: https://arxiv.org/abs/2606.11118
作者: Rahul Roy,Nur Sunar,Jayashankar M. Swaminathan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers in a discrete-time setting. In each period, a customer arrives seeking service, and the platform chooses an assortment of sellers to display. The customer then proposes a transaction to at most one seller in the assortment according to a multinomial logit choice model. After a fixed number of periods, sellers review the proposals they have received and each chooses at most one customer according to another multinomial logit choice model, after which the cycle repeats. A key challenge is that the platform does not know the choice-model parameters of either customers or sellers in advance. To our knowledge, this is the first study of a dynamic assortment problem in which both sides’ choice parameters are unknown. We develop a data-driven algorithm that learns these parameters while optimizing the platform’s objective over time. We evaluate performance using regret, which measures revenue loss relative to a clairvoyant benchmark that knows all parameters and customer arrivals in advance. We show that the algorithm’s worst-case regret grows polylogarithmically over time, and we derive a matching lower bound, establishing its rate optimality.

[LG-10] Limitations of Learning Tanh Neural Networks with Finite Precision

链接: https://arxiv.org/abs/2606.11104
作者: Philipp Grohs,Matěj Trödler
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We investigate limitations of learning \tanh neural networks from point evaluations under finite-precision computations and L^p accuracy guarantees, building on Berner, Grohs, and Voigtländer (2023). Our approach is based on a novel construction of sharply localized bump functions via iterated \tanh activations. Using this mechanism, we show that, in a finite-precision setting, no adaptive randomized algorithm based on m samples can achieve a convergence rate higher than the Monte Carlo rate O(m^-1/p) in the L^p norm, unless the sampling budget grows exponentially with the size of the network parameters and architecture. The results reveal fundamental limitations imposed by finite precision on the learnability of classes containing localized bump functions, extending previous results for ReLU networks to the \tanh setting.

[LG-11] Do Transformers Actually Help Intrusion Detection? A Temporal Sequence Evaluation on CIC-IDS2017

链接: https://arxiv.org/abs/2606.11098
作者: Zach Moczkodan(1),Hany Ragab(1) ((1) Royal Military College of Canada, Kingston, Canada)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 11 pages, 9 figures, 9 tables. Preprint. Code: this https URL

点击查看摘要

Abstract:Recent deep learning approaches for network intrusion detection increasingly incorporate temporal architectures such as recurrent networks and Transformers, often reporting near-perfect performance on CIC-IDS2017. However, many existing studies neither supply their temporal modules with genuine sequence inputs nor evaluate under realistic, leakage-free conditions, making it unclear whether reported gains arise from true sequence-modeling capability. In this work, we reformulate CIC-IDS2017 as a temporal intrusion-detection task by constructing ordered flow sequences from network conversations and benchmarking nine classical and deep learning architectures under a random split, two leakage-free splits, and a padding-scheme ablation. The central finding is that padding convention, not architecture, determines the Transformer’s performance: on genuinely sequential (non-padded) windows the Transformer achieves the highest macro-F1 of any model in the experiment (0.89); under zero-pad+mask evaluation it drops markedly (-0.24 macro-F1), while LSTM, GRU, and 1D-CNN remain stable. Under leakage-free group evaluation the Random Forest is the most robust model (+0.009), while the Transformer’s false-alarm rate grows from 0.04% to 2.7%, a 67-fold increase invisible under conventional protocols. These findings demonstrate that evaluation methodology – specifically padding convention and split protocol – has a larger effect on reported performance than architectural choice, and that widely used random splits with repeat-last padding can overestimate model robustness by up to 0.24 macro-F1. We advocate leakage-free splits, explicit padding disclosure, and sequence-aware benchmarking as standard practice in future IDS research. Code and implementation details are available at this https URL.

[LG-12] Exploring the Design Space of Reward Backpropagation for Flow Matching

链接: https://arxiv.org/abs/2606.11075
作者: Ruoyu Wang,Boye Niu,Xiangxin Zhou,Yushi Huang,Tongliang Liu,Chi Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highlighting a useful decoupling between sampling and optimization. However, the quality of the resulting gradient depends on how accurately this short path approximates the full rollout, especially over long intervals. We propose FlowBP, a unified surrogate-trajectory framework that treats the backward trajectory itself as the design object. FlowBP keeps a no-gradient cached rollout for sampling, then builds a lightweight backward surrogate from cached and selectively re-forwarded velocities. This view separates four choices: the reward-model input, active set, integration weights, and bridge coupling, and recovers prior direct-gradient methods as particular settings. Within this framework, we instantiate three variants: FlowBP-Sparse uses sparse Euler reconstruction, FlowBP-Bridge adds controlled bridge coupling, and FlowBP-Lagrange raises the order of leap quadrature. All three bound memory by the active-set size and limit gradient chaining to at most one Jacobian factor. Across SD3.5-M, FLUX.1-dev, and FLUX.2-Klein-base on preference, quality, and compositional metrics, the three variants improve over direct-gradient baselines on most metrics.

[LG-13] GRAFT: Gain-Recalibrated Adapters for Transformer-Based Neural Population Activity Modeling

链接: https://arxiv.org/abs/2606.11066
作者: Xiangsheng Ge,Yang Xie
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Neural population activity models can recover rich temporal structure from binned spikes, but their read-in and readout layers often remain tied to a fixed set of recorded neurons. This coupling limits reuse in long-term brain-computer interfaces, where recorded neuron identities, counts, and response statistics can change across days. We introduce GRAFT, a Transformer-based neural population activity model that separates reusable temporal dynamics from a recalibratable neuron interface. The neuron interface controls how recorded neurons enter and leave the shared backbone, and auxiliary gain and positional mechanisms support neural activity modeling inside the Transformer. On MC Maze under the standard NLB’21 protocol, GRAFT reaches 0.3866 co-bps as an ensemble, setting a new state of the art on the primary co-bps metric among public and reported NLB’21 results. In a cross-day protocol constructed from the NLB’21 MC Maze dataset series, GRAFT recalibrates from MC Maze to the scaled MC Maze datasets (Large/Medium/Small) by updating only 9.21% of parameters, reaching 0.3749, 0.3112, and 0.3152 co-bps with restricted target-day support sets. These results show that the same interface-backbone separation supports both strong Transformer-based neural population activity modeling and data-efficient cross-day recalibration.

[LG-14] Flexible Kernels for Protein Property Prediction ICML2026

链接: https://arxiv.org/abs/2606.11057
作者: Martin Jankowiak,Yerdos Ordabayev,Rudraksh Tuwani,Henry N. Ward,Hunter Nisonoff,James M. McFarland,Gevorg Grigoryan
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
*备注: 50 pages; to appear at ICML 2026

点击查看摘要

Abstract:Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore–by learning what are in effect structure-aware substitution matrices–we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.

[LG-15] Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

链接: https://arxiv.org/abs/2606.11025
作者: Bowen Ping,Xiangxin Zhou,Penghui Qi,Minnan Luo,Liefeng Bo,Tianyu Pang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at this https URL.

[LG-16] Data-Driven Runway and Taxiway Exits Prediction of Landing Aircraft: A Case Study at Hartsfield-Jackson Atlanta International Airport

链接: https://arxiv.org/abs/2606.11017
作者: Alex Porcayo,Yutian Pang,Maria Thomas,John-Paul Clarke
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Airport surface operations increasingly constrain performance at high-throughput hubs. This study examines arrival taxi-in decisions at Hartsfield-Jackson Atlanta International Airport (KATL) and proposes a two-stage, data-driven decision aid that mirrors controller workflow. Stage I predicts the runway exit selected by an arriving aircraft. Stage II predicts whether, given that exit, the aircraft will cross the active departure runway at a designated point or use the end-around taxiway. Models are trained using ASDE-X surface trajectories, aircraft characteristics, ramp destinations, short-horizon traffic rates, and weather across multiple look-back windows. We benchmark nine classifiers, including Random Forest, XGBoost, LightGBM, and CatBoost, and evaluate accuracy, macro-F1, precision-recall behavior, confusion matrices, Brier score, and Expected Calibration Error. Across east and west flows, XGBoost and LightGBM outperform Random Forest. Stage I achieves 0.86-0.89 accuracy with macro-F1 scores of 0.40-0.50, while Stage II achieves 0.70-0.74 accuracy with macro-F1 scores of 0.28-0.55. Feature-importance analysis shows that approach speed is the main driver of exit choice. Departure rate, crossing rate, ramp destination, and, for west flow, the selected exit are the strongest predictors of crossing versus end-around routing. Minority classes remain harder to predict because of feature-space overlap, as shown by t-SNE and UMAP analyses. The proposed framework supports controller situational awareness through calibrated, explainable predictions while preserving human responsibility for final routing decisions.

[LG-17] Learning Doubly Sparse Explicitly Conditioned Transforms

链接: https://arxiv.org/abs/2606.10975
作者: Tudor Pistol
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注: 10 pages, 1 figure, 1 table. Accepted for publication in Procedia Computer Science (30th International Conference on Knowledge-Based and Intelligent Information Engineering Systems - KES 2026; Invited Session: Global and Constrained Optimization: Algorithms and Applications)

点击查看摘要

Abstract:Finding convenient spaces in which certain hypotheses regarding an assumed sparse structure of natural signals hold true has become a desirable result in recent research, its implications being reflected in areas such as data compression, noise reduction and feature extraction. While the extensively used analytical transforms, such as DFT or DCT, already provide efficient algorithms and robust sparse representations, they assume a fixed prior about the data, failing to accurately capture the specific structure of more restrictive classes of signals. To address this, the concept of a data-adaptive, learnt transform has been introduced in the literature, allowing for the reduction of a residual term in the transform domain. More recent studies have shown that the condition number serves as a good metric in this context, where the desired outcome alternates between a generalizing tendency and one that achieves minimal approximation error. Motivated by these considerations, we introduce the learning of a structured, explicitly conditioned transform formulated as the product of a fixed canonical matrix and a refining data-adaptive sparse component. This approach seeks to preserve the advantages of fast and stable analytical transforms, while introducing controllable adaptivity to the data. No references that concern this specific formulation have been identified so far, indicating its novelty. The proposed algorithm is motivated within the framework of inexact proximal methods, leveraging a newly derived closed-form projection operator. Empirical observations demonstrate state-of-the-art results on the doubly sparse transform learning problem and comparable performance with its dense variant at significantly lower computational costs and sometimes faster convergence and better avoidance of bad local minima.

[LG-18] Population-Aware Physics-Informed Neural Particle Flow for Bayesian Update

链接: https://arxiv.org/abs/2606.10959
作者: Batu Candan,Simone Servadio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural particle flow (PINPF) learns a deterministic transport field that moves particles from a prior distribution toward a Bayesian posterior while enforcing the governing probability-evolution equation. However, the standard PINPF velocity model processes particles independently and therefore does not explicitly condition its transport decisions on the empirical particle population. This paper introduces population-aware PINPF (PA-PINPF), which augments each particle update with a permutation-invariant Deep Sets representation of the full particle set. We investigate two population encoders. PA-PINPF-State summarizes the particle states, whereas PA-PINPF-Feature summarizes the complete local physics-informed feature vectors, including particle position, pseudo-time, measurement information, likelihood values, and score information. The latter allows the population context to represent not only particle-cloud geometry, but also the population-level Bayesian transport geometry. The methods retain the original unsupervised physics-informed residual objective and require no ground-truth posterior samples during training. Experiments on range-measurement tasks and nonlinear time-difference-of-arrival posterior transport demonstrate that both population-aware variants improve over particle-wise PINPF, while feature-population encoding provides the strongest performance. These results show that population-level physics features provide useful global information for learned Bayesian particle transport.

[LG-19] Express Language Modeling

链接: https://arxiv.org/abs/2606.10944
作者: Albert Gong,Annabelle Michael Carrell,Raaz Dwivedi,Lester Mackey
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering \log^3/2(n)/s approximation error with only O(s) memory and O(s^2 \log^2(n)) compression overhead for a sequence of length n . We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.

[LG-20] A Systematic Approach for Selecting Trajectories for Data Augmentation

链接: https://arxiv.org/abs/2606.10938
作者: Adam Nordling
类目: Machine Learning (cs.LG)
*备注: 39 pages, 4 figures, Masters project

点击查看摘要

Abstract:Trajectory data augmentation is a promising approach to mitigate data scarcity in machine learning applications, but its utility has been limited by the complexity of preserving spatio-temporal coherence. Although prior work demonstrated the viability of geometric perturbation, it relied on naive random selection, leaving a critical gap in understanding which trajectories should be augmented for maximal benefit. This thesis addresses this gap by developing a systematic and scalable framework to evaluate five systematic selection strategies: Outlierness, Diversity, Representativeness, Uncertainty, and Random selection. These strategies were rigorously tested across four datasets covering animal behavior (Foxes and Starkey), maritime traffic (AIS), and urban traffic (Car) using a suite of linear and non-linear machine learning models. As part of this evaluation, an Optuna-based hyperparameter optimization loop was integrated to empirically identify the best-performing augmentation parameters for each dataset within the explored search space. The results indicate that, while systematic selection is not a universal solution, it offers distinct advantages over the random baseline. Systematic strategies, particularly Outlierness and Uncertainty, demonstrated higher stability and were less prone to performance degradation observed with random sampling in dense datasets. However, the findings also reveal that the value of augmentation is strictly conditional. Visual analysis via UMAP demonstrates that while systematic augmentation successfully repairs topological fragmentation in sparse datasets, it can act as a corrupting noise signal in high-quality, dense datasets. Furthermore, the study identified physical limitations in high-velocity domains, where standard perturbation techniques lead to divergence in feature space…

[LG-21] ask Robustness via Re-Labelling Vision-Action Robot Data

链接: https://arxiv.org/abs/2606.10918
作者: Artur Kuramshin,Özgür Aslan,Cyrus Neary,Glen Berseth
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. However, these policies continue to struggle with following instructions, likely due to the limited linguistic and action sequence diversity in existing robotics datasets. This paper introduces Task Robustness via Re-Labelling Vision-Action Robot Data (TREAD), a scalable framework that leverages large Vision-Language Models (VLMs) to augment existing robotics datasets without additional data collection, harnessing the transferable knowledge embedded in these models. Our approach leverages a pretrained VLM through three stages: generating semantic sub-tasks from original instruction labels and initial scenes, segmenting demonstration videos conditioned on these sub-tasks, and producing diverse instructions that incorporate object properties, effectively decomposing longer demonstrations into grounded language-action pairs. We further enhance robustness by augmenting the data with linguistically diverse versions of the text goals. Evaluations on LIBERO demonstrate that policies trained on our augmented datasets exhibit improved performance on novel, unseen tasks and goals. Our results show that TREAD enhances both planning generalization through trajectory decomposition and language-conditioned policy generalization through increased linguistic diversity.

[LG-22] Conservation Laws from Data Symmetry in Neural Networks

链接: https://arxiv.org/abs/2606.10913
作者: Jakob Galley,Vahid Shahverdi,Axel Flinth
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We explore whether intrinsic symmetries of the training data lead to conserved quantities during gradient-flow training of neural networks. Under the assumption that the loss function is analytic and non-polynomial, we prove that data symmetries generically do not induce any additional integrals of motion. For mean squared error (MSE) loss, on the other hand, there are situations in which data augmentation yields extra conserved quantities. We build a framework, utilizing \emphtensorizable networks to describe this phenomenon. Tensorizable networks are a family of architectures whose dependence on parameters and inputs can be separated using an intermediate representation. They include linear and polynomial networks, as well as Lightning Attention.

[LG-23] Non-linear mechanical field reconstruction coupling recurrent neural networks with physics-informed graph neural networks

链接: https://arxiv.org/abs/2606.10909
作者: Manuel Ricardo Guevara Garban,Yves Chemisky,Étienne Prulière,Michaël Clément,Martin Abendroth,Björn Kiefer
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Reconstructing local stress fields in heterogeneous microstructures under non-linear, history-dependent loading remains a major computational bottleneck in multi-scale simulations. We propose a coupled LSTM-GNN framework that links the temporal and spatial aspects of local stress field reconstruction. A Long Short-Term Memory network encodes macroscopic stress-strain sequences into a compact hidden state that captures the path-dependent constitutive response, while a physics-informed Graph Neural Network reconstructs the spatially-resolved stress field at each time step. We introduce a relative weighting strategy with linear warm-up to balance the data-driven reconstruction loss and a discrete divergence-based equilibrium penalty. This resolves the scale mismatch that prevents fixed-weight formulations from converging in the elasto-plastic regime. The model is trained on 10,000 non-proportional loading paths applied to a periodic plate-with-a-hole microstructure and von Mises elasto-plasticity. The model achieves three orders of magnitude speedup over finite element simulations and generalizes to loading sequences twice the training length, with 1.9% cumulative error. Because the graph relies on mesh connectivity instead of the specific element type, one trained surrogate can be applied directly without retraining to meshes with different element types and to both coarser and finer resolutions, while in all cases reproducing the high-fidelity quad-element FE field used during training. Indeed, the message passing characteristics inherent to GNN and MeshGraphNet architecture render the model mesh-agnostic. Analysis of the LSTM hidden states suggests a low-dimensional structure related to the internal state variables of the constitutive model.

[LG-24] When Do Autoregressive Sequence Models Forecast Physical Wavefields? A Controlled Study on Synthetic Seismograms

链接: https://arxiv.org/abs/2606.10868
作者: Waleed Esmail,Stuart Russell,Jana Klinge,Alexander Kappes,Christine Thomas
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 16 pages, 5 figures and 3 tables

点击查看摘要

Abstract:Long-horizon autoregressive forecasting of oscillatory physical signals, such as seismograms, gravitational-wave strain, and similar wavefields is limited by error accumulation: as a causal model is fed its own outputs over hundreds of steps, small per-step errors compound into phase drift that pointwise metrics fail to detect. We ask when such rollout stays stable, using synthetic three-component seismograms as a physically structured testbed and the \textscSeismoGPT autoregressive forecaster as the model under study. Through controlled, intra-architecture ablations evaluated on free-running rollout with paired significance tests, we isolate the contribution of each design choice. Multi-token prediction is the dominant stabilizer, accounting for almost the entire improvement over a single-token baseline ( +0.040 median NCC); a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss each add a small but consistent further gain. Performance depends sharply on a context-ratio threshold near one, roughly the full P-S interval of observed signal, below which rollout generalization collapses. The dominant residual failure is a polarity inversion that a magnitude-based spectral loss cannot, by construction, penalize, identifying phase-aware objectives as the natural next step. We frame this as a controlled study of rollout stability on oscillatory wavefields, not a benchmark of forecasting architectures.

[LG-25] Embodiment-conditioned Generalist Control for Multirotor Aerial Robots

链接: https://arxiv.org/abs/2606.10857
作者: Orestis Konstantaropoulos,Welf Rehberg,Mihir Kulkarni,Kostas Alexis
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a generalist position control policy capable of controlling arbitrary multirotor configurations of a certain rotor count (e.g., hexarotors or quadrotors) with a single set of network weights. The policy is conditioned on a physics-grounded embodiment descriptor: a mass and inertia-normalized control allocation matrix that captures how mass-normalized motor thrusts generate linear and angular accelerations in the body-frame. To train the policy, we sample from a broad distribution of arbitrary multirotor configurations, including non-planar and asymmetric systems, and optimize a single, compact network using Proximal Policy Optimization. Training requires only five minutes on an RTX 3090 GPU using a custom NVIDIA Warp-based dynamics simulator. Through extensive simulation experiments, we show that embodiment conditioning enables robust generalist control across arbitrary morphologies. We demonstrate zero-shot real-world transfer of this generalist policy on three diverse hexarotor systems, including a planar robot, a partially symmetric non-planar system, and a random asymmetric, non-planar configuration.

[LG-26] MODIP: Efficient Model-Based Optimization for Diffusion Policies

链接: https://arxiv.org/abs/2606.10825
作者: Zakariae El Asri,Philippe Gratias-Quiquandon,Nicolas Thome,Olivier Sigaud
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine-tuning remains challenging because actions are generated through a multi-step denoising process. In this work, we propose MODIP, a framework for the offline-to-online fine-tuning of DPs. Rather than directly applying RL to the DPs, MODIP leverages a world model (WM) to guide policy adaptation and keeps the simplicity and stability of BC. We utilize model predictive control (MPC) to generate high-quality trajectories within the WM, and use them as supervised targets for fine-tuning the DP. To make MPC planning efficient, MODIP uses a terminal state value instead of a policy-dependent state-action value, reducing inference time. Additionally, MODIP trains critics with policy-independent TD targets, reducing training time. Experiments on D4RL (MuJoCo, Kitchen) and RoboMimic tasks show that MODIP improves diffusion policies beyond BC, and is competitive with or outperforms diffusion policy RL fine-tuning methods and strong model-based baselines such as TD-MPC2.

[LG-27] Encoding the Euler Characteristic Transform

链接: https://arxiv.org/abs/2606.10824
作者: Nello Blaser,Odin Hoff Gardaa,Lars M. Salbu,Elena Xinyi Wang,Bastian Rieck
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:The Euler Characteristic Curve (ECC) records the Euler characteristic of a linearly embedded cell complex as a function of filtration height in a given direction, and the Euler Characteristic Transform (ECT) is the injective shape descriptor obtained by collecting ECCs over many directions. How the ECT is encoded for a neural network is itself an inductive bias, conventionally fixed by discretizing each ECC. We introduce a continuous encoding: for each direction and each vertex it records the net Euler-characteristic change attributed to that vertex, producing a per-direction token sequence that a small transformer maps to a feature vector. We separate the resulting pipeline into two stages on orthogonal axes: an ECC encoder that acts within each direction, mapping its curve to a fixed-length vector, and an ECT representation that acts across directions, aggregating the per-direction vectors into one. We study six ECT representation architectures spanning a range of inductive biases, from a structure-agnostic feedforward baseline to convolutional and complex-valued models that preserve equivariance under planar rotations. Across six classification benchmarks covering point clouds, graphs, cubical complexes, and meshes, the continuous encoding improves accuracy on five of six datasets, and control experiments attribute the gain to the tokenization itself rather than to the added transformer capacity. The representation architecture matters less than the encoding, and the payoff from its inductive biases depends on the encoding: a feedforward network performs best under continuous encoding but is less robust under discretization than convolutional architectures.

[LG-28] CITRAS-FM: Tiny Time Series Foundation Model for Covariate-Informed Zero-Shot Forecasting

链接: https://arxiv.org/abs/2606.10798
作者: Yosuke Yamaguchi,Issei Suemitsu,Yuki Kajihara,Wenpeng Wei
类目: Machine Learning (cs.LG)
*备注: Accepted to EUSIPCO 2026

点击查看摘要

Abstract:Pretrained time series foundation models (TSFMs) have enabled zero-shot forecasting on unseen target series. However, existing TSFMs often incur high computational cost and provide limited support for diverse variable types, often failing to account for covariates that exogenously influence target variability. To address these challenges, we propose CITRAS-FM, a tiny 7M-parameter TSFM that supports univariate, multivariate, and covariate-informed zero-shot forecasting with real-time CPU inference. Built on a patch-based, decoder-only Transformer, CITRAS-FM introduces Shifted Attention into the cross-variate module to effectively exploit known covariates accessible throughout the forecast horizon. Moreover, to enable covariate-aware pretraining despite the scarcity of covariate-rich corpora, we propose CovSynth, which synthesizes realistic covariates from decomposed components of target series. Experiments on fev-bench, spanning 100 tasks across various settings, demonstrate that CITRAS-FM achieves state-of-the-art zero-shot accuracy among sub-10M TSFMs while delivering sub-0.1-second CPU inference, offering a strong balance between forecasting accuracy and real-time deployability.

[LG-29] Closing the Modality Gap in Zero-Shot HAR: Contrastive Training and Separability-Optimized Prototypes on IMU Data

链接: https://arxiv.org/abs/2606.10789
作者: Anik Ghosh
类目: Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Zero-shot learning (ZSL) for inertial measurement unit (IMU)-based human activity recognition (HAR) faces a central challenge: bridging the gap between sensor embeddings and semantic class representations. We systematically evaluate seven configurations combining three inference methods with two training pipelines on the PAMAP2 dataset, using 14 seen and 4 unseen activity classes with subjects 108 and 109 held out for testing. We find that the modality gap is a training-time phenomenon governed by the encoder objective. A temporal convolutional network (TCN) trained with cross-entropy over label-name Sentence- BERT prototypes yields sensor embeddings with a mean cosine similarity of 0.30 to the corresponding text prototypes, while replacing the label-name prototype targets with discriminative activity descriptions raises this to 0.69. This alignment improvement transfers consistently across all three inference methods. The strongest result combines contrastive training with inverted softmax correction, achieving 73.2% accuracy and 0.583 macro F1 on unseen classes, compared to 58.3% accuracy and 0.34 macro F1 for the label-name baseline. A secondary finding is that richer text descriptions reduce inter-prototype separability in Sentence-BERT space, because shared biomechanical vocabulary causes the language model to compress the prototype cloud. This effect does not negate the benefits of contrastive alignment provided prototype descriptions retain sufficient discriminative vocabulary. We also demonstrate that overall accuracy is a misleading primary metric when test-set class distributions are imbalanced, and recommend macro-averaged F1 as the standard reporting metric for ZSL-HAR benchmarks.

[LG-30] Secure Aggregation with Top-K Sparsification in Decentralized Federated Learning

链接: https://arxiv.org/abs/2606.10780
作者: Hengxuan Tang,Jinbao Zhu,Xiaohu Tang
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 1 figure, accepted to IEEE ISIT 2026

点击查看摘要

Abstract:Secure aggregation is a vital component for mitigating gradient leakage in federated learning, but its communication cost conventionally scales with the gradient dimension. This becomes prohibitive for large models and even more pronounced in decentralized federated learning with limited bandwidth and unreliable nodes. Top-K gradient sparsification is an effective approach to reduce communication by transmitting only a few entries of the full gradient, while maintaining competitive model accuracy. Nevertheless, the top-K entries selected by each user are unpredictable and vary across users, which poses a challenge for efficient sparse secure aggregation. This paper studies information-theoretic secure aggregation with top-K sparsification in decentralized federated learning under user dropouts and user collusion. We propose a communication-efficient sparse secure aggregation scheme that offloads dimension-dependent overhead to an offline phase and protects private gradients using random masks and permutations. Experimental results demonstrate that our scheme preserves accuracy comparable to full-gradient aggregation even with only 1% gradient sparsification, while substantially reducing the communication cost.

[LG-31] Can we trust our models? Epistemic calibration in second-order classification

链接: https://arxiv.org/abs/2606.10777
作者: Arthur Hoarau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty estimation is critical for deploying machine learning models in high-stakes settings. However, classical calibration only assesses the reliability of predicted probabilities and does not evaluate whether epistemic uncertainty estimates are themselves trustworthy. This limitation is particularly relevant for second-order classification models. We introduce epistemic calibration, a principled criterion that measures whether reported epistemic uncertainty faithfully reflects the dispersion of model predictions around the ground truth. We show that epistemic calibration is a strictly stronger notion than classical calibration and captures failure modes invisible to standard metrics. We relate this work to the existing literature through an impossibility theorem that holds under the epistemic calibration hypothesis. To operationalize this concept, we propose the Expected Epistemic Calibration Error (EECE), which we prove to be a consistent estimator of a True Epistemic Calibration Error (TECE). Experiments across a broad range of uncertainty quantification methods show that epistemic calibration is a coherent and meaningful criterion and reveal substantial differences across methods, despite similar predictive performance.

[LG-32] Inverse Probability Weighting and Age-of-Information Aggregation for Decentralized Federated Learning under Partial Reception

链接: https://arxiv.org/abs/2606.10774
作者: Chanuka A.S. Hewa Kaluannakkage,Rajkumar Buyya
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 14 pages, 8 figures, research paper for journal submission

点击查看摘要

Abstract:Decentralized Federated Learning (DFL) over lossy wireless networks faces two key challenges: selection bias, where updates from poor-quality links are systematically underrepresented due to partial model reception, and update staleness, where asynchronous nodes contribute outdated information. We show that uniform gossip aggregation with local-fill reconstruction introduces persistent link-quality-induced bias, while completeness-based weighting further amplifies this effect. To address these challenges, we propose DFL-AA (Decentralized Federated Learning with Adaptive AoI-weighted Aggregation), which combines Inverse Probability Weighting with online EWMA-based channel estimation to correct selection bias and Age-of-Information-based weighting to mitigate staleness without requiring global synchronization. We theoretically show that DFL-AA removes link-quality distortion in expectation and experimentally demonstrate consistent improvements over state-of-the-art baselines across varying loss rates, network sizes, and heterogeneous wireless conditions.

[LG-33] MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents

链接: https://arxiv.org/abs/2606.10742
作者: Yv Zhang,Hao Sun,Hao Fang,Kuofeng Gao,Fan Mo,Bin Chen,Shu-Tao Xia,Yaowei Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Preprint. 27 pages, 6 figures, 6 tables

点击查看摘要

Abstract:External memory has become a core component of modern web agents, enabling long-horizon reasoning through the retrieval of past experiences. However, this paradigm introduces a critical vulnerability: malicious content injected into memory can be persistently recalled and repeatedly influence agent behavior. In this work, we identify and systematically study multimodal memory poisoning, an overlooked yet practical attack surface in web-agent systems. We propose MemVenom, a unified black-box attack framework that poisons graph-structured external memory with coordinated text-image evidence. Our method consists of a two-stage design: (1) a trigger-conditioned retrieval attack that ensures high-probability recall of malicious memory, and (2) a post-retrieval attack induction that leverages adversarial perturbations and stealthy OCR injection to override the original user objective. Unlike prior attacks that operate on prompts or text-only memory, our approach enables persistent, reusable, and goal-agnostic attacks without modifying model parameters or re-optimizing malicious tasks. Experiments across multiple web-agent frameworks and vision-language models demonstrate that MemVenom achieves strong end-to-end attack success with minimal impact on benign performance, reaching up to 99.15% on GPT-5-family web agents, while transferring effectively across architectures and model scales.

[LG-34] SPACR: Single-Pass Adaptive Training of Uncertainty-Aware Conformal Regressors

链接: https://arxiv.org/abs/2606.10734
作者: Soundouss Messoudi,Sylvain Rousseau,Sébastien Destercke
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal Prediction (CP) provides robust uncertainty guarantees for predictive models, but is typically applied post hoc, which misaligns model training with the conformal goal of producing efficient (i.e, narrow) intervals. We propose SPACR (Single-Pass Adaptive Conformal Regressor), a novel method for directly training uncertainty-aware regressors within a differentiable loss. SPACR jointly optimizes efficiency and validity without batch-splitting or a predefined confidence levels during training. As a result, a single SPACR model yields valid prediction intervals at multiple confidence levels during inference, avoiding the costly retraining required by methods like DOICR. Experiments on diverse datasets show that SPACR consistently gives tighter intervals and better coverage-efficiency trade-offs compared to standard CP and DOICR, while significantly reducing computational costs.

[LG-35] Do LLM sMakeNeural Distinguishers Wise?

链接: https://arxiv.org/abs/2606.10692
作者: Tatsuya Sakagami,Masashi Hisai,Naoto Yanai
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural distinguishers are a cryptanalysis method for symmetric-key cryptography that trains machine learning models on pairs of plaintexts and ciphertexts with specific differences in order to recover a secret key. To the best of our knowledge, no existing work has explored the use of large language models (LLMs) for neural distinguishers. In this paper, we propose LLM-based neural distinguishers through a prompt design and conduct extensive experiments with them on SPECK-32/64 to investigate whether LLMs can strengthen neural distinguishers. We then found three key insights. First, by comparing the results of LLM-based neural distinguishers with ResNet in the existing work, we demonstrate that LLMs provide no observable improvement in the performance of neural distinguishers. Second, we confirm that, at high rounds, the choice of differences is no longer effective for LLM-based neural distinguishers as well as ResNet. Third, we show that the performance of LLM-based neural distinguishers can be significantly improved by incorporating only the XOR operation results as a prompt design.

[LG-36] PL-KKT-hPINN: Enforcing Nonlinear Equality Constraints on Neural Networks via Piecewise-Linear Projection

链接: https://arxiv.org/abs/2606.10682
作者: Fateme Mohammad Mohammadi,Hector Budman,Joshua L. Pulsipher
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While physics-informed neural networks (PINNs) have shown strong potential for process modeling, physical equations are only enforced as soft constraints during training, and thus, they do not guarantee constraint satisfaction at inference. We propose a framework, called piecewise-linear Karush–Kuhn–Tucker hard-constrained PINNs (PL-KKT-hPINNs), that strictly enforces nonlinear equality constraints through piecewise-linear projection. This extends the KKT-hPINN framewor, which exactly enforces linear equalities through the Karush–Kuhn–Tucker (KKT) conditions associated with orthogonally projecting neural network outputs onto the constraint feasible region. The method is demonstrated on a continuous stirred-tank reactor (CSTR) case study for both one and two inputs. Results show that PL-KKT-hPINN preserves predictive accuracy comparable to that of a standard neural network while achieving substantially lower constraint violations. In addition, the proposed model shows improved robustness in low-data regimes, yielding lower RMSE than the unconstrained neural network for limited training sample sizes. These results demonstrate that PL-KKT-hPINN provides a computationally efficient and physically consistent framework for surrogate modeling of nonlinear chemical engineering systems.

[LG-37] One Step Closer to Ground Truth: A Multi-Scale Residual-Aware Representation Learning Pipeline for Predicting Time Series Data KDD KDD’26

链接: https://arxiv.org/abs/2606.10678
作者: Amrijit Biswas,Mustafa Kamal,Robin Krambroeckers,M. M. Lutfe Elahi,Sifat Momen,Nabeel Mohammed,Shafin Rahman
类目: Machine Learning (cs.LG)
*备注: Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

点击查看摘要

Abstract:Transformer-based models have emerged as leading paradigms in time-series forecasting in recent years, employing self-attention mechanisms to capture long-range dependencies. Despite their success, these single-stage forecasting architectures exhibit persistent systematic residual biases arising from structural discrepancies, unmodeled stochastic components, or inadequate multi-scale temporal representations. This limitation persists when residuals are treated as irreducible noise, precluding adaptive correction of structured error patterns. To address this limitation, we introduce a two-stage, model-agnostic framework that explicitly decouples forecasting and residual learning into distinct stages of representation learning. A base transformer first generates the initial predictions. Subsequently, a dedicated meta-corrector dynamically models structured error patterns across multivariate channels, preserves cross-variable dependencies, and iteratively refines the residual bias of the base transformer. By formalizing this pipeline as a hypothesis space expansion, our framework addresses approximation limitations inherent in single-stage architectures, removes reliance on restrictive assumptions, and enables end-to-end learning of complex error dynamics. Evaluated on eight popular benchmark datasets using established protocols, our approach achieves state-of-the-art performance, with significant improvements in standard metrics (MSE, MAE). The results demonstrate the framework’s ability to mitigate systematic biases and enhance robustness to complex temporal dynamics, advancing the practical applicability of transformer-based forecasting models.

[LG-38] PhysMetrics.Weather: An Evaluation Framework for Physical Consistency in ML Weather Models

链接: https://arxiv.org/abs/2606.10642
作者: Emma Kasteleyn,Timo Maier,Axel Lauer,Veronika Eyring,Pierre Gentine,Ana Lucic
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Preprint

点击查看摘要

Abstract:Machine learning weather prediction (MLWP) models have achieved impressive forecasting performance at a small fraction of the computational costs required for traditional physics-based methods. However, they are primarily (1) data-driven and (2) evaluated using pixel-wide error metrics (e.g., RMSE), so there are no guarantees that their forecasts are consistent with known physical laws. We introduce this http URL, an evaluation framework that assesses the physical realism of MLWP models across three types of metrics: conservation, spectral, and dynamical. By quantifying physical realism, this tool guides the development of physics-informed architectures and helps evaluate whether MLWP models are reliable for operational use. Our framework is available on Github at this https URL.

[LG-39] oward Proactive RF Charging Scheduling: Generative AI for Decision Support

链接: https://arxiv.org/abs/2606.10600
作者: Amirhossein Azarbahram,Osmel M. Rosabal,David Ernesto Ruiz-Guirola,Melike Erol-Kantarci,Kaibin Huang,Onel L. A. López
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radio frequency wireless power transfer (RF-WPT) is an enabling technology for supporting uninterrupted communications in future Internet of Things systems by reducing the need for battery replacement and mitigating battery-waste-related issues. For large-scale RF-WPT deployment, one of the main challenges is the scheduler-level resource allocation. Specifically, the transmitter must decide how much energy to deliver, when, and to whom, under limited charging resources, incomplete receiver-side information, and uncertain near-future charging conditions. This article positions generative artificial intelligence (GenAI) as a promising tool for this setting because it can foresee multiple plausible charging scenarios conditioned on coarse operational context and receiver-side information. We propose GenAI to act as an uncertainty-aware support layer for the RF-WPT scheduler rather than as a standalone forecasting or decision-making tool. To this end, we first revisit the main challenges of RF-WPT scheduling, and discuss how major GenAI families can support uncertainty-aware charging decisions by generating scenario-based inputs for downstream tasks. We then present a warehouse-style case study showing that preserving uncertainty through the sampling capability of generative models can improve robust charging decisions compared with deterministic prediction and simple non-learning baselines, especially under risk-sensitive objectives. Finally, we identify key open challenges and present some directions for future research.

[LG-40] Dirichlet-Guided Group Forecasting for Alleviating Over-smoothing in Time Series Forecasting

链接: https://arxiv.org/abs/2606.10592
作者: Xingyu Zhang,Jingyao Wang,Xin Yu,Zeen Song,Jianqi Zhang,Changwen Zheng,Wenwen Qiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting often suffers from over-smoothing, especially when future dynamics are multi-modal. Forecasts may follow the coarse trend of the observed future, but fail to preserve sharp changes, oscillations, turning points, and regime transitions that define plausible dynamic evolution. In this work, we revisit over-smoothing from the perspective of latent dynamical mode compression: under partial observation and single-realization supervision, multiple plausible future modes can be weakened, merged, or averaged during forecasting. Based on this view, we propose Dirichlet-Guided Group Forecasting (DGF), a mode-preserving forecasting framework that explicitly models multiple mode-conditioned predictive distributions and uncertainty over their selection probabilities. DGF uses a Dirichlet-guided hierarchical sampling mechanism and reward-based optimization to encourage forecasts that are accurate, dynamically consistent, and mode-distinct. Extensive experiments on real-world forecasting benchmarks show that DGF reduces over-smoothing while improving forecasting accuracy, diversity, and dynamical consistency.

[LG-41] rading Utility for Dynamic Fairness in Multiple Resource Division with Sequential Demand

链接: https://arxiv.org/abs/2606.10472
作者: Kaiqi Jiang,Karim El Husseini,Wenzhe Fan,Xinhua Zhang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic multi-resource allocation is a central problem in shared computing environments, where users’ demands arrive sequentially and resources must be distributed fairly without knowledge of future demands. Existing methods emphasize fairness guarantees such as Sharing Incentive, Envy Freeness, and Dynamic Pareto Optimality, but often overlook system utility. Moreover, these fairness criteria are mutually incompatible, preventing strict enforcement of them at the same time. We propose a neural allocation mechanism that reconciles fairness with utility through multi-objective optimization during sequential rollout. We first formalize fairness in the dynamic setting via stepwise loss functions for Sharing Incentive, Envy Freeness, and Dynamic Pareto Optimality, enabling differentiable training. Leveraging non-wastefulness, we parameterized the solutions by constraining allocations to the subspace of demand while allowing elastic over-allocation when resources remain available. Empirical results demonstrate that our learned allocator achieves substantially higher utility at comparable levels of fairness, uncovering clear Pareto-frontier-like tradeoffs across metrics. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2606.10472 [cs.GT] (or arXiv:2606.10472v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2606.10472 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling

链接: https://arxiv.org/abs/2606.10440
作者: William Won,Jinsun Yoo,Tuan Ta,Moumita Dey,Andy Balogh,Pradosh Datta,Furkan Eris,Conor Green,Winston Liu,Changhai Man,Kingshuk Mandal,Amos Rai,Vinay Ramakrishnaiah,Ruchi Shah,David Sidler,Harsh Sikhwal,Hanjiang Wu,Tushar Krishna,Bradford M. Beckmann
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 10 pages, 15 figures, one table

点击查看摘要

Abstract:Distributed machine learning (ML) is a key paradigm for today’s large-scale artificial intelligence applications. As model inference arises as an important use case, faithful modeling of latency-sensitive collective communication has never been more important. Capturing the device architecture and modeling control and data paths at high fidelity is therefore a necessity today. Having a common, detailed representation for distributed ML infrastructure is also crucial. We revisit the promising open-source, community-driven simulator: ASTRA-sim. In this work, we identify limitations of the current ASTRA-sim simulator and augment it with new features. To this end, we enable fine-grained, high-fidelity simulation with a standardized infrastructure representation, opening new design space exploration opportunities. We propose the simulation at cache-line-sized load-store granularity, with a detailed graphics processing unit (GPU) execution model, to balance simulation scalability and fidelity. We also introduce InfraGraph, a standardized representation to capture distributed ML network infrastructure in detail. Using the updated ASTRA-sim 3.0 simulator, we showcase interesting design space explorations for designing optimized collective algorithms, network requirements, and GPU architectures.

[LG-43] A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection

链接: https://arxiv.org/abs/2606.10410
作者: Davood Fattahi,Runze Yan,Saurabh Kataria,Zhaoliang Chen,Xiao Hu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM)
*备注: 22 pages, 11 figures, 4 tables. Under review at Physiological Measurement

点击查看摘要

Abstract:Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and \sim 9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis. Comments: 22 pages, 11 figures, 4 tables. Under review at Physiological Measurement Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM) Cite as: arXiv:2606.10410 [cs.LG] (or arXiv:2606.10410v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.10410 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Davood Fattahi [view email] [v1] Tue, 9 Jun 2026 04:35:16 UTC (2,448 KB)

[LG-44] Validation-Stage Combinatorial Fusion Analysis for Imbalanced Credit-Card Fraud Detection

链接: https://arxiv.org/abs/2606.10393
作者: Xiao Han,Chenyu Wu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Credit-card fraud detection is difficult because fraudulent transactions are rare, costly, and unevenly distributed. Strong gradient-boosted tree models already perform well on structured transaction data, so the value of another fusion method is not obvious. This paper examines whether Combinatorial Fusion Analysis (CFA), which searches over model subsets and rank-score fusion rules, can still add value on the IEEE-CIS Fraud Detection benchmark. Using a leakage-free 60/20/20 train/validation/test protocol, we evaluate 480 fusion configurations built from seven base classifiers. The best test-set result comes from diversity-weighted score fusion of Random Forest, XGBoost, and LightGBM (DEF WtScore), with AUC-ROC = 0.9405, AUPRC = 0.6699, and F1 = 0.6373. Bootstrap confidence intervals from 1,000 resamples show that the gains over the strongest single model exclude zero for all three metrics. CFA matches soft voting on AUC-ROC, improves AUPRC and F1, and outperforms stacking in this setting. A CTGAN augmentation experiment gives a negative result: synthetic fraud samples degrade both individual models and CFA. Overall, CFA is most useful here not as a way to combine every classifier, but as a validation-stage method for choosing a small, complementary subset and assigning diversity-aware weights.

[LG-45] Beyond Explaining Predictions: Logic-Based Explanations for Confidence in Machine Learning Models

链接: https://arxiv.org/abs/2606.10347
作者: Vinícius Peixoto Chagas,Carlos Henrique Leitão Cavalcante,Thiago Alves Rocha
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Machine learning is increasingly used in critical domains, where both predictions and their associated confidence levels influence important decisions. To enhance transparency in such scenarios, it is important to understand why a model is confident or uncertain about its predictions. Recent logic-based approaches provide abductive explanations, minimal subsets of features sufficient to preserve the predicted class, with correctness guarantees. However, these methods focus solely on classification behavior and may produce explanations that cover instances with low predictive confidence. In this work, we introduce the concept of Minimum Confidence Threshold (MCT), which quantifies the weakest confidence guarantee provided by an abductive explanation. Building upon this concept, we propose confidence-aware abductive explanations, which preserve not only the predicted class but also a user-specified confidence guarantee. We formulate MCT computation as an optimization problem and introduce an algorithm for generating minimal explanations that satisfy a desired confidence threshold. We evaluate the proposed framework on boosted trees for binary classification, although the approach is applicable to other machine learning models that provide confidence scores. Experimental results show that traditional abductive explanations often provide substantially weaker confidence guarantees than the confidence associated with the explained instance itself. In contrast, confidence-aware explanations consistently improve the minimum confidence guaranteed by an explanation while requiring only a modest increase in explanation length. These properties make the proposed approach particularly suitable for applications where both predictive correctness and confidence are essential for trustworthy decision making.

[LG-46] Privacy-Preserving Credit Risk Prediction with Alternative Data

链接: https://arxiv.org/abs/2606.10333
作者: Hongzhe Zhang,Jiarong Xu,Jing He,Xiao Fang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Credit risk prediction is a critical problem in the consumer credit industry. Traditionally, financial institutions construct credit risk prediction models using borrowers’ demographic, financial, and credit history data, collectively referred to as traditional data. Recent studies have demonstrated that alternative data, such as borrowers’ mobile phone communication data, enable lenders to acquire fuller and more accurate profiles of borrowers’ creditworthiness, thereby improving credit risk prediction performance. Nevertheless, alternative data are held by external entities independent of financial institutions. Directly sharing alternative data with financial institutions infringe on consumer privacy, yet existing credit risk prediction studies largely overlook this issue. To address this gap, we define a new problem, namely privacy-preserving credit risk prediction with alternative data, which simultaneously considers three practical constraints: the privacy-preserving constraint that protects consumer privacy, the model-confidentiality constraint that learns and stores the model centrally at the financial institution, and the lossless constraint that maintains the performance of the learned model. To solve this problem, we develop PrivacyCredit, a novel privacy-preserving machine learning method. We then theoretically demonstrate the privacy-preserving, model-confidential, and lossless properties of PrivacyCredit. Through extensive experiments using a real-world credit dataset linked with alternative data, we demonstrate the predictive value of securely incorporating alternative data into credit risk prediction and show that PrivacyCredit achieves the same predictive performance as the model learned from the insecure plaintext combination of traditional and alternative data. We further evaluate its model-confidentiality property and computational efficiency.

[LG-47] Rank Collapse Fixed Points and the Renormalization Group Structure of MLP Residual Networks

链接: https://arxiv.org/abs/2606.10324
作者: Parviz Haggi-Mani,Irina Rish
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:The analogy between deep neural network forward passes and renormalization group (RG) flows has been repeatedly noted in the literature, but existing treatments remain qualitative: depth is described as a coarse-graining scale, attention is likened to a partition function, and representations are said to flow toward fixed points. No existing work has defined a measurable RG order parameter, tested it under controlled variation of the input distribution, or made quantitative predictions that are empirically verified. We study the simplest architecture for which the analogy is tractable: a pure MLP residual stack trained on masked token prediction over synthetic Markov chain sequences with known spectral properties. We report three findings. (i) The effective rank of the residual stream decreases monotonically with depth after training, consistent with progressive integration of irrelevant degrees of freedom. (ii) This rank collapse is selective: it occurs for chains with short correlation length approximately 1 but is absent for chains with long correlation length approximately 7, measured at the position level to control for mean-pooling artifacts. The network preserves exactly the degrees of freedom relevant to the prediction task, the content of the RG relevance criterion. (iii) Inter-layer kernel drift is concentrated at one or two specific transitions, with the remainder of the network near a fixed point, consistent with a discrete fixed-point plateau. Together these findings constitute the first quantitative, position-level evidence that MLP residual networks implement a selective coarse-graining procedure governed by the spectral structure of the input distribution.

[LG-48] Revisiting Positive Samples in Graph Contrastive Learning: From the Perspective of Message Passing

链接: https://arxiv.org/abs/2606.10284
作者: Lianze Shan,Ningchong Wang,Jitao Zhao,Di Jin,Dongxiao He
类目: Machine Learning (cs.LG)
*备注: 24 pages,6 figures

点击查看摘要

Abstract:Graph Contrastive Learning (GCL), which trains graph encoders by maximizing similarity between positive samples and minimizing it between negative ones, has emerged as a mainstream graph pre-training paradigm. It is widely recognized that positive samples are essential in GCLs. Ideally, maximizing the similarity of positive samples enables graph encoders to capture intrinsic semantic and patterns of graph data. However, we discover an interesting phenomenon: GCLs can achieve competitive performance even without positive samples. This motivates us to revisit the fundamental mechanism of positive samples in GCLs. From the perspective of Dirichlet energy, we theoretically finds that message passing, a key mechanism in graph encoders, trivializes the maximization of positive samples, preventing GCLs from effectively learning from positive samples. To address this, we propose SPGCL to mitigate the trivialization caused by message passing and restore the learning efficacy of positive samples. Specifically, we find that high Dirichlet energy features help positive samples provide effective learning signals while low Dirichlet energy features contribute little to positive learning signal but is useful for positive sampling. Based on this, SPGCL propagates only high Dirichlet energy features and uses low energy features to construct a probability matrix for reliable positive sampling. Extensive experiments demonstrate the effectiveness of SPGCL.

[LG-49] A Unified Adaptive Feature Composition Framework for Multi-Task Generalization in Wireless Foundation Models

链接: https://arxiv.org/abs/2606.10277
作者: Yuxuan Shi,Tingting Yang,Kangning Ma,Liwen Jing,Yuwei Wang,Mengfan Zheng,Li Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Though wireless foundation models (WFMs) have shown strong potential in learning universal channel representations, their adaptation to various downstream tasks remains constrained by existing paradigms. Fine-tuning strategies introduces substantial computational and storage overhead, while frozen feature extraction leads to sub-optimal performance across diverse downstream tasks. To address this issue, we propose a unified adaptive feature composition framework for multitask generalization in WFMs, where the key component is the Routing Adapter for Feature Composition (RAFC). Instead of extracting only the final-layer output, this router treats the hidden states from different Transformer depths as a reusable pool of multi-level hidden features, and employs a lightweight task-driven feature composition network to generate layer-wise aggregation weights, then adaptively combine hierarchical representations through weighted summation. This design enables each downstream task to access suitable mixture of low-, mid-, and high-level wireless features without modifying the pretrained backbone. Extensive experiments on four representative wireless tasks demonstrate that RAFC consistently outperforms conventional adaptation baselines while introducing fewer than 50K additional parameters. Moreover, the learned routing weights provide interpretable evidence of task-specific layer preferences, making the proposed framework a low-complexity, scalable, and explainable interface for adapting WFMs to diverse downstream scenarios.

[LG-50] When Design Rules Break: Benchmark Composition Determines Whether Label Informativeness Predicts GNN Aggregator Choice

链接: https://arxiv.org/abs/2606.10249
作者: Neha Sharma,Ritesh Sharma
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:We examine whether graph neural network (GNN) design rules generalize across benchmark families by studying aggregator selection (sum, mean, max) on 24 node-classification datasets spanning citation, heterophilic, LINKX Facebook-100, co-purchase, and co-authorship graphs. Edge homophily is only weakly predictive of the GIN-Sum versus GIN-Mean performance gap. Label informativeness predicts this gap well on legacy benchmarks but degrades substantially when Facebook-100 graphs are included. In these dense friendship networks, near-zero label informativeness coexists with a strong preference for sum aggregation, producing gains of 7-10% and up to 13% under extended training. Stochastic block model ablations, including degree-corrected variants matching Facebook-100 degree scales, fail to reproduce this behavior, indicating that mean degree alone does not explain the effect. Among several label-independent graph statistics, the spectral gap uniquely distinguishes these graphs from other low-informativeness datasets, with the effect localized to one-hop neighborhoods and replicated across architectures. We further identify training regimes that interact with aggregator choice and show that PNA can underperform the best single-aggregator GIN on standard citation benchmarks. Our results suggest that benchmark composition, rather than numerical insufficiency, determines whether design rules appear to generalize, and that the Facebook-100 regime provides a concrete target for future adaptive aggregation methods.

[LG-51] DUET – Dual User Embedding Transformers for Offsite Conversion Prediction

链接: https://arxiv.org/abs/2606.10243
作者: Reazul Hasan Russel,Mingwei Tang,Rostam Shirani,Xinlong Liu,Navid Madani,Leo Ding,Yawen He,Xiangyu Wang,Mustafa Acar,Ashish Katiyar,Yuhai Li,Alan Yang,Metarya Ruparel,Derek Qiang Xu,Rupert Wu,Rui Yang,Liang Tao,Xinyi Zhao,Larry Zhang,Sri Reddy,Rob Malkin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offsite conversion rate (OCVR) prediction is an important ranking problem in computational recommendation systems. This task presents a modeling challenge: click signals are abundant and exhibit short temporal horizons, whereas conversion signals are inherently sparse, long-delayed, and frequently unattributed. Despite these statistical disparities, both signal types must inform models that operate within strict serving-latency constraints. Prior pre-training approaches address this heterogeneity with a single, undifferentiated encoder applied uniformly across both data streams. We propose DUET (Dual User Embedding Transformers), a framework that explicitly partitions user behavioral data into two domain-coherent streams – clicks and conversions – and pre-trains dedicated transformer encoders with architectures tailored to each stream’s statistical characteristics: multi-layer self-attention for the dense click stream and interleaved cross- and self-attention for the sparse conversion stream. The resulting complementary embeddings are jointly consumed by a downstream ranker without exceeding serving-latency budgets. Evaluation demonstrates up to 0.38% normalized entropy (NE) reduction relative to the strongest baseline, and A/B test shows consistent improvements in OCVR prediction accuracy.

[LG-52] What Demonstration Curation Metrics Do to Your Policy

链接: https://arxiv.org/abs/2606.10229
作者: Aarav Bedi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 pages, 1 figure, 2 tables

点击查看摘要

Abstract:We study whether demonstration-curation metrics that detect defective training episodes also improve the downstream behavior-cloning policy that trains on the curated data. On a contact-rich LIBERO pick-and-place benchmark with a controlled structural defect (early gripper release during the carry phase), we find that the two quantities are sharply decoupled. The metric with the highest defect-detection AUROC (0.804) produces the worst curated policy (13.3% task success), while a metric with a substantially lower AUROC (0.638) produces a policy that nearly matches the oracle trained on ground-truth clean data (90.0% vs. 93.3%). We further show that five of the seven metrics we evaluate exploit episode length as a trivial proxy for the defect label, a confound that inflates reported AUROCs to near-perfect values and disappears once episode length is controlled. Across all conditions, the contaminated baseline succeeds on only 3.3% of rollouts, and the two best curation methods close this to within 3 percentage points of the 93.3% oracle ceiling. Our results argue that curation methods should be evaluated by the policy they produce, not the defects they flag, and that any curation benchmark must control for episode length before reporting detection accuracy. We release the testbed, all metric implementations, and the evaluation pipeline.

[LG-53] Spatiotemporal Graph Transformer for 3D Neighborhood Interaction and Quality Prediction in Metal Additive Manufacturing

链接: https://arxiv.org/abs/2606.10227
作者: Joyce Karen Pelaez,Siqi Zhang,Hoo Sang Ko
类目: Machine Learning (cs.LG)
*备注: Submitted to Journal of Intelligent Manufacturing, 23 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Metal additive manufacturing enables the fabrication of complex parts, but achieving consistent build quality remains challenging due to interactions induced by repeated layer-wise melting, solidification, and reheating across the 3D build. Advanced sensing provide a great opportunity to collect rich observations of the actual manufacturing process for real-time quality monitoring and control. Yet, existing methods often have limited ability to represent multi-layer interactions and quantify their contributions to quality. In this paper, we develop a novel spatiotemporal graph transformer for modeling 3D neighborhood interactions and learn their effects on build quality in metal additive manufacturing. Specifically, we first introduce a weighted network representation of the manufacturing process, where fusing locations are modeled as nodes, and their spatial- and process-dependent relationships are encoded as edge weights. This representation also enables the integration of multimodal data (e.g., geometric design, process settings, and in-situ sensing data) into a unified structure for downstream learning tasks. Building on this network, we further design a dual-attention graph transformer that captures both within-node feature dependencies and cross-node neighborhood interactions for quality representation learning. Experimental results show that the proposed framework significantly outperforms image-based, sequence-based, and graph-based models in characterizing process-quality relationships. More importantly, the incorporation of cross-layer interactions is critical for improving quality prediction performance. This framework is broadly applicable to other tasks involving network modeling and graph-based representation learning.

[LG-54] Alignment Defends LLM s from Property Inference Attacks

链接: https://arxiv.org/abs/2606.10217
作者: Pengrun Huang,Chhavi Yadav,Ruihan Wu,Kamalika Chaudhuri
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted through property inference attacks, posing a confidentiality risk. Existing defenses against these attacks primarily operate by modifying the training data distribution and hence require access to the original data and retraining the model, limiting their applicability to settings where data is unavailable or models are already deployed. In this work, we propose alignment-based defenses for mitigating property inference attacks in LLMs. Our approach reshapes the model’s output distribution towards a target property ratio via post-training alignment, without modifying the training data. In particular, we adapt two widely used RLHF frameworks–Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)–as our defenses by constructing preference pairs and defining a specific reward function respectively. Through comprehensive experiments, we show that our alignment based defenses effectively mitigate property inference attacks while maintaining a strong utility confidentiality tradeoff.

[LG-55] Learning Entropy and Spatial Adaptation Dynamics of Multilayer Perceptrons for Structural Point Extraction

链接: https://arxiv.org/abs/2606.10170
作者: Jan Glaser,Ivo Bukovsky,Marcel Jirina
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper extends the concept of Learning Entropy (LE) from temporal adaptive systems to spatial learning in multilayer perceptron networks (MLPs) applied to image data. Instead of evaluating image structure directly from gradients or covariance operators, as local neighborhood methods do, the proposed approach analyzes the learning process itself through Learning Entropy. An MLP is trained to predict the intensity of a center pixel from its surrounding spatial context, while LE is evaluated from the incremental adaptation of neural weights during learning across image-derived samples. The resulting Spatial Learning Entropy Maps (SLEM) identify unusual image points and regions that induce strong adaptation of the neural network and therefore have an important role in the learning process. The results indicate that spatial Learning Entropy provides a complementary perspective to conventional feature extraction and explainability methods by highlighting spatial locations that are particularly informative for network learning. Spatial Learning Entropy provides a complementary perspective to conventional feature extraction and explainability methods by identifying image points and regions according to their learning impact rather than their local structural properties. The proposed framework may open new directions for learning-driven image or scene analysis in computer vision, manufacturing, and robotics.

[LG-56] Quality Is Not a Safety Proxy Under Quantization

链接: https://arxiv.org/abs/2606.10154
作者: Sahil Kadadekar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 21 pages, 6 figures. Preprint

点击查看摘要

Abstract:Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF ladder, and AWQ/GPTQ INT4 checkpoints. In this matrix the shortcut fails: all 36 quality-safety pairings split direction across models, and 9 hidden-danger rows plus 1 near-hidden-danger row show quality stable or improved while refusal falls by 12-68 percentage points. Seven of the 11 AWQ/GPTQ rows are hidden-danger. A four-probe mechanistic follow-up over the 17 Hugging Face-backed FP16/AWQ/GPTQ cells does not rescue it: entropy, refusal-direction, and calibration probes are weak or null separators of dangerous rows, and although probe-identified safety-associated neurons absorb 1.39 \times more quantization error overall ( p 5 \times 10^-7 ), the effect is not regime-specific. Claude Sonnet 4 relabels 11,470 items in a predefined stratified set, agrees with the primary gemma3:12b judge on 89.9% of rows ( \kappa = 0.873 , 95% CI [0.866, 0.881]), and changes 0/10 hidden-danger cells. A calibrated study-internal behavioral screen – the Refusal Template Stability Index (RTSI), built from four refusal-template drift features and calibrated on this matrix – routes 10/10 hidden- or near-hidden-danger rows to direct safety testing (Wilson 95% CI lower bound 0.72) while leaving 23 of 45 non-baseline rows in a low-risk bucket under both in-sample scoring and row-level leave-one-out validation; on the same matrix, the best single-feature baselines (unique-prefix-rate-delta, raw refusal-rate delta) recover 9/10 and 8/10 respectively at matched bucket size, and cross-stack transfer requires recalibration. For the quantized checkpoints, model families, and safety outcomes studied here, retained quality cannot waive direct safety evaluation.

[LG-57] Compositional Generative Modeling from Decentralized Data ICML2026

链接: https://arxiv.org/abs/2606.10153
作者: Mashrur M. Morshed,Vishnu Naresh Boddeti
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Learning the compositional nature of the physical world requires joint observation of interacting factors. However, because practical data is often decentralized, these factors are fragmented across isolated silos. Existing decentralized generative approaches focus only on modeling the union of siloed data, overlooking novel combinations implied by the collective whole. To bridge this gap, we introduce Decentralized Compositional Flow Matching (DCFM), a framework that enforces structural constraints across the global set of generative factors, without exchanging any raw data. DCFM enables novel combinations to emerge through peer interactions, even when no single data source can independently support the composition. Empirically, DCFM substantially outperforms federated learning and mixture-of-experts baselines across conditional image generation, robotic spatial planning, and medical attribute co-occurrence modeling.

[LG-58] Ambiguous Strategic Classification

链接: https://arxiv.org/abs/2606.10137
作者: Ivri Hikri,Nir Rosenfeld
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A common assumption in strategic classification is that the classifier is public knowledge. However, it remains unclear whether, and why, a system would choose to commit to full disclosure. We study a setting in which regulation requires the system to disclose some, but not all, of the information. This induces a learning task in which the learner must jointly optimize the classifier and the uncertainty surrounding it. To this end, we adopt from robust mechanism design the notion of ambiguity, which in our setting allows the learner to reveal a set or range of possible classifiers, while privately choosing which of them to ultimately realize. We investigate how ambiguity affects the learning task, develop efficient algorithms for computing best-responses and training, and empirically explore strategic learning and its outcomes in this novel setting and using our approach.

[LG-59] Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2606.10129
作者: Tai Nguyen,Phong Le,Carola Doerr,Nguyen Dang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: arXiv admin note: text overlap with arXiv:2505.12982

点击查看摘要

Abstract:While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, owing to the difficulty of deriving effective, interpretable multi-parameter policies amenable to formal study. We demonstrate how deep-RL can be leveraged to overcome this barrier, using the (1+( \lambda , \lambda ))-genetic algorithm optimizing OneMax, one of the few problems where a super-constant speedup of dynamic control has been formally proven, as a representative case study. We first show that standard approaches struggle to converge in this multi-parameter setting, and introduce algorithm-agnostic enhancements targeting action-space decomposition, reward shifting, and long-horizon discounting. With these in place, we compare common deep-RL methods and find that Double Deep Q-Networks uniquely avoid the policy collapse observed in Proximal Policy Optimization, yielding trajectories suitable for downstream analysis. Crucially, we move beyond the ``black-box’’ nature of neural networks by distilling the learned behaviors into a transparent, symbolic control policy. This resulting policy does not only offer interpretability for future theoretical analysis but also yields exceptional performance, consistently outperforming existing baselines across a wide range of problem sizes.

[LG-60] Nonlinear Estimator: Dual Bayesian Affine Estimators for Parameter Learning

链接: https://arxiv.org/abs/2606.10111
作者: Sasan Vakili,Daniël Woonings,Pradyumna Paruchuri,Peyman Mohajerin Esfahani
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 32 pages, 9 figures

点击查看摘要

Abstract:This paper presents a nonlinear parameter estimator for Wiener-type state-space models obtained as a fixed-point architecture that couples two affine minimum mean-squared error (MMSE) estimators: one for the unknown parameters and one for latent variables. The architecture retains the functional structure of the optimal affine MMSE parameter estimator while incorporating Dynamic Basis Statistics (DBS) estimates that summarize nonlinear basis-function evaluations. Two DBS construction strategies are developed, leading to two nonlinear estimator frameworks. The dual basis-parameter estimator combines an affine basis estimator with the affine parameter estimator, whereas the dual state-parameter estimator first computes affine state estimates and their covariances, then maps these state-estimate statistics through a Gaussian DBS operator to obtain DBS estimates. Both dual estimators admit fixed-point characterizations that alternate between estimating each component using the updated prior of the other, obtained from that component’s plug-in estimate statistics from the previous iteration. The efficacy of the proposed methods is examined via extensive Monte Carlo experiments, showing that the dual basis-parameter estimator attains parameter mean-squared errors comparable to those of the purely affine parameter estimator, while the dual state-parameter estimator achieves the lowest parameter mean-squared error, outperforming both the dual basis-parameter and purely affine parameter estimators, as well as sequential Monte Carlo variants of classical Particle Gibbs and Expectation-Maximization schemes.

[LG-61] Decision-Making under Combinatorial Risk

链接: https://arxiv.org/abs/2606.10092
作者: Yifan Hong,Hongmiao Fan,Chen Wang
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Decision-making under risk is typically studied through single-shot lottery choices. Yet many real decisions involve combinatorial risk, where risk arises from multiple risky components, so the lottery over outcomes is induced rather than given outright and can be costly to evaluate exactly. We introduce an investment-allocation task to study decision under combinatorial risk, where investing in a component raises its success probability and thereby reshapes the outcome distribution. Participants favor the option with the larger probability increment, and, when increments are equal, the option with the higher initial success probability. Revealing the induced probability mass function (PMF) substantially changes behavior, making participants less responsive to combinatorial-risk features and reducing choice variance. To explain these patterns, we move beyond standard benchmarks and hand-crafted hypotheses with symbolic regression to discover compact descriptive models. The discovered models rely mainly on combinatorial-risk features, such as the after-investment success probability, rather than exact evaluation of the full induced distribution. Behavior under the displayed PMF is then well explained by augmenting this model with a prospect-theoretic residual model. The results show that people navigate combinatorial risk primarily through its core features, shifting toward lottery valuation only when the induced PMF is displayed.

[LG-62] SoK: Colluding Adversaries in Machine Learning Pipelines USENIX-SECURITY

链接: https://arxiv.org/abs/2606.10091
作者: Vasisht Duddu,Lipeng He,Asim Waheed,N. Asokan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: USENIX Security Symposium, 2026

点击查看摘要

Abstract:Machine learning (ML) models are susceptible to various security, privacy, and fairness risks. Adversaries with different characteristics (i.e., objectives, knowledge, and capabilities) can collude by executing one attack to amplify others. Existing work lacks a systematic framework to explore collusion among adversaries, and to study the implications of the adversaries’ characteristics. We present a framework covering collusion (a) between train- and inference-time adversaries, and (b) among inference-time adversaries. Our framework accounts for factors enabling collusion between adversaries. We propose a guideline to conjecture about the potential for collusion using enabling factors. We use it to explain prior work, conjecture about unexplored collusions, and empirically validate five such cases. Finally, we discuss how adversaries’ characteristics influence the potential for collusion.

[LG-63] Structured Adaptive Tensor Prediction for Streaming Data

链接: https://arxiv.org/abs/2606.10085
作者: Zhen Qin,Yang Chen
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Matrix-valued time series arise in a wide range of applications, such as spatio-temporal data from medical imaging and geophysics. Existing methods are mainly designed for static settings and lack adaptability to streaming and time-varying environments. Adaptive filtering techniques have also been largely limited to data with scalar or vector values, leaving adaptive forecasting for matrix-valued time series inadequately understood. To bridge these gaps, we develop an adaptive tensor regression framework that includes Matrix-on-Matrix (MoM) and Tensor-on-Matrix (ToM) formulations for streaming matrix-valued prediction. The two formulations differ in whether to directly model matrix-valued outputs or to exploit temporal structure via higher-order tensor representations. For the proposed tensor regression framework, we develop stochastic gradient descent (SGD) algorithms for online learning. We show that stacking multiple responses across time into higher-order tensors improves performance; in particular, the ToM achieves lower steady-state error and stronger denoising capability than MoM, motivating our focus on the ToM model. We further characterize the tracking behavior of SGD under time-varying dynamics. From a statistical perspective, we establish fixed-time recovery guarantees for ToM under general low-dimensional structures, including sparsity, low-rankness, and their joint sparselow-rank models.

[LG-64] Spatiotemporal Seismic Hazard Assessment Using VQ-VAE and Seismic Statistical Features

链接: https://arxiv.org/abs/2606.10069
作者: Wei Quan,Denise Gorse
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:In this paper we build upon a previous study in which we demonstrated, using XGBoost and earthquake catalogue data from Japan and Chile, that a set of 60 seismic statistical features (SSFs) had much greater predictive value than a set of 428 generic time series features from the tsfresh package. We here extend this previous work in two key ways, focusing on data from Japan as a large dataset is necessary in order to allow for the training of a deep learning (autoencoder) model. First, we move from whole-region prediction (considering, for each candidate event, the likelihood of an event M \geq 5.0 anywhere in the region in the next 15 days) to localised predictions in which both the region of feature computation and the region of prediction are restricted to a circle of radius 24 km around the candidate event, and we show that performance remains excellent, similar to our previous whole-region study for the same area. Second, we here couple this proven set of SSFs, based on one-dimensional (catalogue) data, with a novel feature based on two-dimensional seismic maps, obtained by training a VQ-VAE model to reproduce such maps as output and identifying a measure of its error in doing so with a localised build-up of crustal stress. We show that while localised prediction based on SSFs can be effective alone, with test AUC values as high as those obtained in the case of Japan in our previous whole-region study, the inclusion of the new natively-spatial VQ-VAE-derived feature, top-ranked by SHAP analysis, can enhance performance and additionally appears to near-wholly replace the traditionally-computed b -value in terms of feature usage.

[LG-65] Spiking Neural Network inference on FPGAs with hls4ml

链接: https://arxiv.org/abs/2606.10008
作者: Barry M. Dillon
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) provide a naturally temporal machine-learning framework. Their neurons maintain an internal state and propagate information through discrete spikes, enabling low-latency temporal inference. Although SNNs are often associated with asynchronous neuromorphic processors, many scientific real-time inference systems rely on conventional synchronous field-programmable gate arrays (FPGAs) and high-level synthesis (HLS) workflows. In this paper we present an extension of hls4ml that enables clock-driven deployment of SNNs trained in pytorch onto FPGA firmware. We demonstrate the workflow using a dense quantised SNN trained on the Heidelberg Spiking Digits dataset where it achieves inference latencies of approximately 34\mu s. We validate the generated design through software reference comparisons, HLS C simulation, HLS synthesis, export, and Vivado synthesis reports. This work opens up the hls4ml toolkit to neuromorphic computing, allowing streamlined optimisation, synthesis, and deployment of SNN models for real-time inference.

[LG-66] Data-aware Static Analysis: Improving Detection of Semantic Faults in Machine Learning Code Using Data Characteristics ICSE

链接: https://arxiv.org/abs/2606.09957
作者: Willem Meijer,Kristian Sandahl,Dániel Varró
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 2 listings, 1 table; To be published in “2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE-NIER '26)”

点击查看摘要

Abstract:Semantic faults specific to the use of machine learning models are a common problem for machine learning developers, causing suboptimal predictions, high computational cost, or incorrect outputs. For example, one may erroneously use unscaled data to train a scale-sensitive model. Machine learning developers detect these faults after training their models and manually analyzing the results, making it an inefficient process. We propose a novel data-aware static analysis approach to detect semantic faults in machine learning code, allowing developers to reveal these bugs while writing code instead of after training the model. Our approach uses combined data and control flow analysis, and API contracts, enabling data-aware reasoning about machine learning code at a high level of abstraction. We highlight the potential of our solution by analyzing a sample of real-world machine learning notebooks, finding that we can detect faults that require a data-aware approach.

[LG-67] Multi-task LLM s for Bug Classification: Efficient Inference with Auxiliary Decoding Heads

链接: https://arxiv.org/abs/2606.09956
作者: Nikolai Rozanov
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 8 pages, 6 pages appendix

点击查看摘要

Abstract:The rapid adoption of LLM-powered code generation has dramatically accelerated software development, yet effective verification methods remain severely underdeveloped. Existing bug localization techniques are either prohibitively expensive, requiring minutes of agentic reasoning and thousands of generated tokens per file, and/or operate at coarse function-level granularity unsuitable for precise debugging. While works that focus on line-level granularity and are more light-weight are often limited in their performance or context size. We introduce a novel line-level bug localization approach that addresses these limitations through three key contributions: (1) a token alignment algorithm that overcomes fundamental tokenization challenges in previous work, (2) a lightweight multi-task LLM for bug localization (MLC) enabling efficient line-level bug classification, and (3) an optimized training recipe for multi-line prediction. Our method achieves state-of-the-art performance among similar setups on line-level bug localization with full-file context. At the same time we reach comparable performance to agentic approaches on Defects4J and PypiBugs benchmarks while reducing inference latency by orders of magnitudes, requiring only a single generated token per file. We further demonstrate strong generalization by introducing and evaluating on a small out-of-domain evaluation datasets in Python. We will open source our code, models, and datasets upon acceptance.

[LG-68] Hasse Diagrams for Attention: A Partial Order Framework for Designing Transformer Masks

链接: https://arxiv.org/abs/2606.09951
作者: Chentao Li,Han Guo
类目: Machine Learning (cs.LG)
*备注: 21 pages, 9 figures. Theoretical framework for attention mask design; no experiments included

点击查看摘要

Abstract:During the training of large Transformer models, attention masks regulate the scope and direction of information flow across a sequence. Numerous mask variants exist, and operators such as FlexAttention already support arbitrary attention masks. Nevertheless, a systematic formal analysis of the information-flow structure induced by arbitrary masks has been missing. This paper develops a complete theoretical framework. We prove that, with sufficient depth, the information flow of a multi-layer Transformer converges to a Hasse diagram – a directed acyclic graph representing a partial order. Building on this, we recast the design of parallel training tasks as the problem of finding a minimal common supergraph of Hasse diagrams, and we establish a criterion for the minimal common supergraph. This yields a constructive method to derive attention masks directly from a family of tasks. Applying the framework, we design two novel masks: a block-generation attention mask that ensures training-inference consistency (Block Two-Stream Attention), and a fully supervised bidirectional attention mask (Butterfly Attention). These results demonstrate the framework’s capacity to discover new structures.

[LG-69] Integrating Out Twice:The Open-System Case That Neural-Network Ensemble Theory Is Missing

链接: https://arxiv.org/abs/2606.09950
作者: Jin Lei
类目: Machine Learning (cs.LG); Nuclear Theory (nucl-th); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Averaging a neural network over its random parameters and marginalizing a Gaussian sector are the same operation, the Schur complement of the eliminated block, and when that block is closed it returns a covariance and its inverse. That is all a network ensemble produces, the closed case. The open case is missing, and nuclear reaction theory has it worked out. Projecting a scattering problem onto a chosen set of channels, with the rest carrying probability irreversibly to a continuum, leaves a non-Hermitian effective generator that conserves and itemizes exactly what it loses: the nuclear optical model and its generalized optical theorem. I set the two cases side by side using only the moments of a distribution, the algebra of Gaussians, and block inversion, no field theory, and give the closed-case dictionary in full: the neural tangent kernel is the Fisher sensitivity kernel, the infinite-width Gaussian limit is the Gaussian-process emulator, and the lazy-to-feature transition is the validity boundary of a reduced-basis emulator. I then test the open export on a truncated attention map, a token-level transfer operator, and a sparse expert router, and report a mostly negative result. The conserved flux ledger ports wherever openness is genuinely present, but its distinctive content is absent, an artifact of the chosen partition, or pinned near a floor by the training objective, and the operationally useful uncertainty turns out to be epistemic, living in the closed half of the correspondence, not the open one. The negative has a structural reason this note makes precise: the open case needs an eliminated sector with a continuous spectrum and wave-like, not relaxational, dynamics, which mainstream learning’s finite or dissipative objects do not supply. This is a note, not a result; its main finding is that negative one, and its value is the map that locates it.

[LG-70] nCMD: Benign-Anchored Feature Selection for Imbalanced Network Intrusion Detection

链接: https://arxiv.org/abs/2606.09934
作者: Abu Fuad Ahmad,Istiaque Ahmed
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 6 pages, IEEE double columns

点击查看摘要

Abstract:Feature selection is critical for network intrusion detection systems (NIDS) operating under high-dimensional, highly imbalanced traffic, as found in operational and defense networks. Traditional filter methods rank features using global statistics computed symmetrically across classes and thus fail to capture the asymmetry of intrusion detection, where attacks are best characterized as deviations from dominant benign traffic. We propose benign-anchored Classwise Mean Deviation (nCMD), a lightweight and interpretable method that scores feature relevance based on the deviation of attack-class distributions from the benign-class mean, rather than a globally biased reference. This approach aligns feature selection with the operational semantics of NIDS at no additional computational cost. Across four benchmark datasets (CICIDS2017, CICDDoS2019, NSL-KDD, and UNSW-NB15), multiple feature budgets, and three downstream classifiers, nCMD matches or exceeds classical filter baselines in macro-averaged F1-score. It achieves the best result on three of the four datasets and under every classifier, with the strongest improvements observed under tight feature budgets and severe class imbalance. These results support benign-anchored ranking as a scalable and interpretable preprocessing component for resource-constrained NIDS. Comments: 6 pages, IEEE double columns Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2606.09934 [cs.LG] (or arXiv:2606.09934v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.09934 Focus to learn more arXiv-issued DOI via DataCite

[LG-71] Compile Once Differentiate Everywhere: A Differentiable Meta-Circular Interpreter

链接: https://arxiv.org/abs/2606.09930
作者: Lucas Sheneman
类目: Programming Languages (cs.PL); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:The boundary between program execution and gradient-based optimization has long limited the use of code itself as a learnable scientific model. We present a compiler that translates a self-hosting subset of Scheme into differentiable computation graphs for autograd backends. Because the subset can compile its own evaluator, this yields differentiable meta-circular interpretation (DMCI): a compiled Scheme interpreter executes programs supplied as data, while reverse-mode autodiff propagates gradients to continuous constants embedded in those programs. The interpreter is compiled once, so new programs inherit differentiability without recompilation or custom gradient machinery, while retaining closures, recursion, and data structures. We prove that gradients through the compiled interpreter are correct almost everywhere and show that they match direct compilation to numerical precision across 171 recursive and higher-order program-seed pairs. We then use DMCI for program-and-parameter co-search, where a large language model proposes Scheme programs and exact gradients calibrate their continuous parameters through a single frozen interpreter. This enables OpenEvolve-style program search in which an outer loop proposes discrete program structures and DMCI supplies exact gradient-based calibration of each candidate’s continuous parameters. On battery capacity-fade data, the search recovers a knee-like degradation structure and improves held-out extrapolation over hand-crafted baselines on the harder early-extrapolation split, matching them on the later split. On a high-dimensional El Nino inverse problem, DMCI optimizes an interpreted Kalman-filter likelihood where gradient-free search fails. These results extend symbolic regression and neurosymbolic search from closed-form expressions to executable, stateful programs, making model-generated code directly optimizable against data.

[LG-72] SPDM: Geometry-Modulated State Space Modeling with Manifold Constraints for Time Series Forecasting

链接: https://arxiv.org/abs/2606.09917
作者: Xingsheng Chen,Siu-Ming Yiu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series forecasting requires capturing the continuously evolving correlation structure among interacting variables. Existing state-space models process time series by scanning tokenized temporal or spatial sequences, discarding the evolutionary geometric structure. We address this limitation by introducing manifold constraints into state-space modeling: treating the cross-variable correlation structure as a continuous trajectory on the symmetric positive definite manifold, whose Riemannian geometric features, tangent space linearity, and Frechet mean centrality act as a principled geometric regularizer that guides and stabilizes the selective scanning dynamics of SSMs. We propose SPDM, a geometry-aware SSM architecture that realizes this principle through two cooperating mechanisms: a manifold trajectory path that projects dynamically evolving covariance matrices from the SPD manifold to a Euclidean tangent space, and a geometric gating scheme that directly modulates SSM’s internal selective parameters based on geometric signals derived from the manifold trajectory. The parameterization preserves the linear-time complexity of the Mamba parallel scan while embedding rich structural constraints, making the architecture preserve prediction accuracy and computational efficiency simultaneously. Extensive experiments on eleven real-world benchmark datasets establish state-of-the-art forecasting performance, and further studies confirm that geometrically constrained state-space dynamics are the dominant architectural factor behind its performance gains.

[LG-73] LMT: A Bayesian Framework for Causal Discovery from Textual Alarm Records in Manufacturing Systems

链接: https://arxiv.org/abs/2606.09892
作者: Xiaofeng Xiao,Jianhong Chen,Qiuzhuang Sun,Naichen Shi,Xubo Yue
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 19 pages

点击查看摘要

Abstract:Textual event records, such as alarm logs, have become an increasingly common data source in engineering and manufacturing systems. Beyond identifying correlations or recurring patterns, engineers are often interested in understanding which types of events causally trigger or influence other events during system operation. Textual event descriptions may contain semantic clues about such causal relationships, and recent large language models (LLMs) provide a promising tool for extracting these signals. However, relying solely on LLM-encoded textual information is insufficient for accurate causal discovery, since semantic patterns do not directly reveal causal mechanisms and may confuse causation with correlation or frequent sequential patterns. To address these challenges, we propose \textbfLMT, a Bayesian causal discovery framework for engineering event data that jointly leverages textual descriptions and timestamps. Specifically, LMT first uses LLMs to extract semantic causal signals from event descriptions and constructs a prior distribution over causal graphs among event types or event clusters. It then incorporates temporal evidence through a Poisson-process-based likelihood, allowing the LLM-informed prior to be refined by timestamp-based statistical evidence. By integrating the textual and temporal information, LMT produces a causal graph that is both interpretable and data-supported. Simulation studies show that the proposed framework is effective across different settings and is especially advantageous in small-sample alarm-event scenarios.

[LG-74] Optuna Constrained Tree-Structured Parzen Estimator Is a Joint Density Generalization of c-TPE

链接: https://arxiv.org/abs/2606.09889
作者: Shuhei Watanabe,Kaichi Irie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Constrained hyperparameter optimization (HPO) is common in practice, yet Optuna’s widely used constrained TPE lacks algorithmic analysis. While c-TPE proposes an expected constrained improvement (ECI) approach assuming independence between the objective and constraints, Optuna uses a single joint density over both. We show that Optuna’s constrained TPE is joint c-TPE – the same ECI acquisition function using a joint likelihood. We demonstrate joint c-TPE is invariant to constraint duplication whereas independent c-TPE degrades as the product accumulates duplicated factors. We outline practical tradeoffs between the formulations and directions for future study.

[LG-75] SinkRec: Mitigating Semantic State Sink in Long Sequence Recommendation with Memory-Conditioned Gated Delta Networks

链接: https://arxiv.org/abs/2606.09888
作者: Zhuang Zhuang,Zhipeng Wei,Ji Dai,Jie Chen,Fei Pan,Peng Jiang,Kun Gai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear attention provides an efficient backbone for long-sequence recommendation by avoiding the quadratic cost of standard Transformers, but its compressed recurrent state can be dominated by repetitive behavior patterns. We identify this phenomenon as semantic state sink, where recurring semantics over-occupy the recurrent state and bias subsequent readouts. To mitigate semantic state sink, we propose SinkRec, a hybrid memory-transition looped architecture that decouples collaborative behavioral pattern storage from dynamic transition modeling. SinkRec externalizes recurring local patterns into a learnable conditional memory through residual vector quantization, reinjects the retrieved codes, and exposes memory key-value pairs to the attention block. It further introduces Temporal-Aware State-Relation Differential Gated DeltaNet (TDGD), which uses memory to purify recurrent writing and reading by suppressing memory-covered updates and removing memory-aligned readout responses. This design turns recurring semantics from state-competing signals into memory-retrievable patterns, allowing the recurrent state to focus on dynamic transitions and alleviating semantic state sink with linear-time efficiency. Experiments on public and industrial datasets demonstrate the effectiveness and efficiency of SinkRec.

[LG-76] ENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts

链接: https://arxiv.org/abs/2606.09885
作者: Jiangyang He,Shaolin Zhu,Deyi Xiong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts large language models (LLMs) scale efficiently through sparse activation, yet their deployment is fundamentally constrained by the large static parameter footprint of experts. Existing compression approaches either remove entire experts, disrupting routing topology and harming performance, or rely on unstructured weight pruning with limited practical efficiency. To address the limitations, we propose TENP, a structured Trapezoidal ExpertNeuron Pruning framework. Using a few samples, we identify and retain important experts, while applying expert neuron pruning (ENP) to less important experts, reserving model parameters in a trapezoidal pattern from shallow to deep layers. When evaluating expert importance, we jointly consider both the magnitude of the expert output and its ability to change the direction of the input vector. For ENP, we measure each neuron’s projected contribution to the expert output to identify and retain important neurons. We conduct extensive experiments on the Qwen and DeepSeek models. Under a routing expert sparsity of 40% and an average of 63.76% activated expert parameters, the DeepSeek model suffers only a 1-point drop in accuracy compared to the full-parameter model. Moreover, it outperforms the full-parameter model by 10% on code generation tasks.

[LG-77] Hyperparameter Learning for Latent Factorization of Tensors for Representation Learning to Large-scale Dynamic Weighted Directed Network

链接: https://arxiv.org/abs/2606.09880
作者: Yaqian Zhan,Jialan He,Tianzhu Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale dynamic weighted directed networks (DWDNs) are widely used to model time-varying interactions among nodes. Latent factorization of tensors (LFT) extracts target knowledge from DWDNs via low-rank embedding. However, similar to many machine learning models, the performance of LFT heavily depends on the selection of hyperparameters. In practice, these parameters are often tuned manually or through grid search, which requires significant computational resources and human effort. Motivated by this challenge, this paper proposes an automated hyperparameter optimization framework based on Differential Evolution (DE) for LFT (DE-LFT). The proposed method integrates DE into the training process of the LFT model to automatically learn optimal regularization parameters \lambda_1 , \lambda_2 and \lambda_3 . As a result, the model can adaptively search the hyperparameter space and improve prediction accuracy. Experimental results on four real-world datasets demonstrate that the proposed approach achieves lower MAE and RMSE compared with manually tuned baselines while reducing the need for extensive parameter tuning.

[LG-78] Operator Fusion for LLM Inference on the Tensix Architecture

链接: https://arxiv.org/abs/2606.09879
作者: Qingbo Wu,Ke Li,Wenzhu Wang,Jie Yu,Ruian Zhang,Lili Liu
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:This study addresses on-device inference bottlenecks of Transformer models on Tenstorrent’s Tensix architecture and proposes an operator fusion strategy that enhances data locality. RMSNorm is fused with matrix multiplication in self-attention and in the FFN, enabling back-to-back execution of memory-bound and compute-bound operators in on-chip SRAM to significantly reduce DRAM reads/writes of intermediate results and scheduling overhead. To support multi-core parallelism, a NoC-based multicast mechanism is leveraged in which row/column master nodes efficiently distribute inputs and weights across the core mesh, alleviating DRAM bandwidth contention. Experiments on the Wormhole platform with Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B show up to 37.44% latency reduction for attention and 15.89% for MLP, with up to 7.91% reduction per decoder layer, while Pearson Correlation Coefficient (PCC) remains above 98.75%, confirming significant end-to-end efficiency gains under numerical consistency.

[LG-79] FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses

链接: https://arxiv.org/abs/2606.09878
作者: Nicholas Saban
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard benchmarks report aggregate accuracy, but practitioners need to know which specific capabilities a model lacks. We introduce FailureScope, a behavioral-diagnosis method that clusters evaluation probes by their cross-model pass/fail patterns (leave-one-model-out, LOMO), and show it yields stable, interpretable failure taxonomies across three regimes usually studied separately: single-turn benchmarks, multi-turn dialogue, and adversarial agent attacks. On 2,664 single-turn tasks across 18 models, taxonomy-conditioned sampling reaches Kendall’s tau = 0.81 at 50 tasks (versus 0.34 for random selection), and cross-model failure prediction reaches AUC 0.88. The same primitive recovers interpretable clusters on a 363-task multi-turn corpus and on 630 adversarial agent traces, where it exposes a meta-failure mode: a 73-100 percentage-point gap between LLM-judge ASR and real execution. Cluster cohesion remains strong across all three regimes, which we take as evidence that behavioral clustering is a portable diagnosis primitive that generalizes beyond any single benchmark. We release the pipeline, three annotated corpora, and the cross-regime taxonomies.

[LG-80] Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLM s

链接: https://arxiv.org/abs/2606.09876
作者: Ke Li,Chongzhe Zhang,Zifan Zeng,Feng Liu,Qunli Zhang,Zheng Hu
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Large language models often express high confidence in answers that are wrong. Standard calibration remedies typically act globally or at the score level, reducing unwarranted confidence but also risking erosion of warranted confidence on correct answers. We introduce Probe-Conditioned Head Intervention (PCHI), an inference-time method that uses a frozen probe to detect likely wrong-but-confident responses and conditionally rescales downstream attention-head outputs during confidence generation. On Qwen3-4B-Instruct solving OpenMathInstruct problems with a structured binary confidence field, readout-token PCHI converts 82.2% of originally wrong-yes confidence readouts to \textttno , while a joint intervention across upstream confidence-template tokens reduces ECE from 21.9% to 9.2% and damages only 5.1% of originally correct-yes readouts. The readout-token effect also appears on Gemma3-4B, though upstream interventions are weaker and more mask-dependent. These results show that verbalized overconfidence can be selectively reduced through conditionally applied internal intervention, partially decoupling the suppression of unwarranted confidence from the loss of warranted confidence.

[LG-81] Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection

链接: https://arxiv.org/abs/2606.09874
作者: Guillaume Coulaud(UM, IROKO),Reza Akbarinia(IROKO),Florent Masseglia(IROKO)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reconstruction-based methods are widely used for time series anomaly detection, where models are trained to reconstruct subsequences, and anomalies are identified through reconstruction errors. However, reported results are often hard to compare due to heterogeneous evaluation practices and underspecified inference procedures. In this paper, we revisit reconstruction-based anomaly detection in the univariate offline setting and study the role of the inference stride, which controls whether subsequences are processed as disjoint windows or with overlap. We propose a unified training, tuning, and multi-seed evaluation protocol on the curated TSB-AD benchmark, and study how overlapping inference affects anomaly detection performance for a range of reconstruction models, including PCA-based baselines, DLinear, an AutoEncoder, TimesNet, and Transformer variants. The results show that across all models, overlapping windows yield consistent improvements, with average relative gain up to +28%, and can alter method rankings. We further analyze variability across datasets, random seeds, and hyperparameter configurations. Finally, we complement the benchmark study with an evaluation on the full UCR archive using localization criteria aligned with sliding-window reconstruction. Overall, our results highlight that reconstruction-based anomaly detection performance depends not only on model architecture and training, but also on inference choices, motivating a clear and reproducible protocol. Our results show that reconstructionbased baselines achieve strong performance on both TSB-AD and UCR benchmarks, supporting them as competitive and practical approaches for univariate time series anomaly detection.

[LG-82] From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents ICML2026

链接: https://arxiv.org/abs/2606.09863
作者: Laksh Advani
类目: Machine Learning (cs.LG)
*备注: Accepted to FAGEN@ICML2026

点击查看摘要

Abstract:LLM agents can fail silently by asserting task completion when the environment state shows otherwise. We study this failure mode, false success, across two agent benchmarks: 9,876 tau2-bench trajectories from 8 model families and 1,879 AppWorld trajectories from 4 model families with text-independent ground truth. False success is common but varies by setting: 45–48% of failures in single-control tau2-bench domains, 3% in dual-control telecom, and 75.8% among AppWorld self-assessing coding-agent trajectories with explicit status claims. LLM judges fail reliably: no configuration across 5 judges, 5 prompt strategies, and full task specifications exceeds AUROC 0.65 on tau2-bench, and the same judges reach only 0.54 AUROC on AppWorld API-call traces. Judges rely on surface completion proxies – confident closing language in tau2-bench and coarse action-sequence volume in AppWorld – rather than verified state changes. Lightweight TF-IDF detectors achieve task-disjoint AUROC 0.83 on tau2-bench and 0.95 on AppWorld, recovering 4–8x more false successes than the best judge at the same flag rate with 3,300x lower latency. These results suggest that production monitoring should use lightweight, domain-calibrated detectors as triage signals rather than relying on LLM judges as the primary monitor for false success.

[LG-83] Uncertainty-aware Multi-fidelity Closure via Conditional Normalizing Flows

链接: https://arxiv.org/abs/2606.09857
作者: Jice Zeng,Shady E. Ahmed,David Barajas-Solano,Panos Stinis
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: No comments

点击查看摘要

Abstract:Reduced-order models (ROMs) provide an efficient surrogate for complex multiscale systems, but their predictive accuracy is often compromised by truncation errors and the inadequate representation of interactions between resolved and unresolved scales. The missing effect of truncated (unresolved) scales on ROM (resolved) scales is often denoted as the closure problem. In this work, we formulate ROM closure modeling as a multi-fidelity (MF) learning problem and propose an uncertainty-aware MF framework based on conditional normalizing flow to enhance ROM predictive accuracy. The proposed approach learns a probabilistic mapping from low-fidelity (LF) ROM coefficients to high-fidelity (HF) coefficients, thereby improving predictive fidelity while quantifying the uncertainty associated with the learned closure. Two correction strategies are investigated: direct learning, in which HF coefficients are predicted directly from LF inputs, and residual learning, which learns the discrepancy between LF and HF coefficients and uses it to recover the corrected HF solution. The framework is demonstrated on a vortex merging problem governed by the two-dimensional Navier Stokes equations. Results show that both correction strategies improve ROM accuracy over uncorrected ROM, with residual learning achieving consistently better performance than direct learning. Moreover, the two proposed deep generative model-based strategies provide uncertainty quantification for the corrected ROM coefficients, which is critical for assessing prediction confidence and supporting the reliable use of ROMs in practical applications.

[LG-84] SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

链接: https://arxiv.org/abs/2606.09853
作者: Konstantinos Kontras,Teodora Gagaleska,Thomas Strypsteen,Christos Chatzichristos,Matthew Blaschko,Maarten De Vos,Paul Pu Liang
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:A central objective in multimodal learning is to capture synergy: task-relevant information that arises only from the joint use of multiple modalities, and is not available from any single modality alone. While most approaches operate at the architectural level through larger or more complex fusion models, we propose a complementary axis: shaping the training objective itself. Standard training often emphasizes unimodal or redundant information, falling short on examples that require cross-modal reasoning. We formalize multimodal synergy through information theory and introduce the Synergistic Information Bottleneck (SynIB), a scalable objective that targets synergy directly. To prioritize learning synergy, SynIB motivates the model to predict accurately from all modalities while penalizing confidence when information from any modality is withheld. Alongside the standard task loss, the model runs forward passes with one modality masked at a time and is penalized for remaining confident, which would indicate reliance on unimodal cues rather than cross-modal interactions. We validate SynIB in two regimes. On synthetic XOR tasks where the ground-truth synergy is known by construction, standard training fails to recover it while SynIB does. On five real-world benchmarks, including three MultiBench affective tasks, Hateful Memes with CLIP-ViT and DeBERTa backbones, and a controllable irony extension of CREMA-D we introduce, SynIB improves accuracy on synergy-dependent examples by up to 7.8% and overall accuracy by up to 3.8%.

[LG-85] Itô maps for any-step SDEs

链接: https://arxiv.org/abs/2606.11156
作者: Zhengkai Pan,Peter Potaptchik,Wenxi Yao,Michael S. Albergo,Jakiw Pidstrigach
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent one-step generative models accelerate sampling by learning deterministic flow maps of the underlying dynamics. These methods rely on learning from ordinary differential equations, leaving open how to define an exact distillation procedure for stochastic dynamics. We introduce the Itô map, an any-step stochastic flow map that takes an intermediate state and Brownian path and predicts future states in a single pass. The Itô map formulation yields novel estimators for inference-time control by providing cheap, differentiable access to posterior samples. Empirically, Itô maps produce diverse, conditionally valid endpoint samples from fixed intermediate states and support strong steering performance on synthetic and image-generation benchmarks. These results establish any-step SDE integration as a useful primitive for posterior sampling and stochastic control.

[LG-86] DMT: Demographic Conditioning Morphology-Enhanced Transformer for Cuffless Blood Pressure Estimation from PPG Signals

链接: https://arxiv.org/abs/2606.11125
作者: Yidan Shen,Neville Mathew,Maham Rahimi,Deependra Dhakal,George Zouridakis,Xin Fu,Renjie Hu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Blood pressure (BP) is a key marker for cardiovascular risk assessment and therapeutic decision-making, and Photoplethysmography (PPG) enables low-cost, wearable-friendly cuffless BP estimation. However, even with recent progress, many PPG-based models are trained with BP regression alone and may rely on amplitude-dominated shortcuts. In addition, demographic covariates that systematically modulate vascular compliance are often incorporated only via late fusion, limiting subject-specific representation learning. We propose a Transformer-based network for cuffless BP estimation from PPG signal, leveraging self-attention to capture long-range dependencies across multiple cardiac cycles. To account for subject-specific vascular differences, the model is conditioned on demographics via FiLM-style feature modulation applied through the attention and feed-forward sublayers of Transformer blocks. In addition, we add an auxiliary morphology head to guide the model to attend to BP-relevant waveform morphology associated with arterial stiffness and wave reflection. Under calibration-based evaluation protocols on the large-scale PulseDB dataset, the proposed method achieves MAE of 4.56 mmHg for systolic BP and 2.62 mmHg for diastolic BP, reducing errors by 47% and 50% compared with prior demographic-enhanced PPG baselines. The resulting lightweight, single-sensor model supports scalable and clinically grounded cuffless BP estimation in calibration-enabled deployment settings.

[LG-87] Generalized Conformal Predictive Systems Under Distributional Shifts

链接: https://arxiv.org/abs/2606.11044
作者: Jef Jonkers,Johanna Ziegel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 10 figures

点击查看摘要

Abstract:Conformal predictive systems (CPS) output calibrated bands of CDFs under exchangeability. We extend generalized CPS to non-exchangeable settings by encoding distributional shifts through observation-specific permutation weights. This yields shift-aware predictive systems that remain valid whenever the test point is, conditionally on the unordered sample, a weighted draw from the observed atoms. Since such weights are typically estimated, we introduce weight-uncertainty boxes and construct robust CPS envelopes with finite-sample or asymptotic confidence guarantees. We derive efficient computation for conformity-measure CPS, conformal binning, and conformal isotonic distributional regression. Experiments under covariate shift and feedback-driven biomolecular design show calibrated predictive bands that widen under stronger shifts and tighten as sample size increases.

[LG-88] Range Penalization: Theoretical Insights with Applications in Federated Learning

链接: https://arxiv.org/abs/2606.10916
作者: Yiyuan She,Zhaojun Hu,Yifan Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper introduces range regularization for federated learning with linear systematic components to enhance statistical accuracy and induce cross-client regularity conducive to quantization, coding, and resource efficiency. Our approach identifies features with shared weights across different clients and adaptively clusters the weights of personalized features at extreme values, a process we refer to as polar clustering. Theoretical analysis of the associated estimators poses significant challenges due to the seminorm nature and non-decomposability of the regularizer. We develop new proof techniques for the nonasymptotic analysis of statistical accuracy and faithful pattern recovery. Moreover, a fast optimization algorithm that leverages varying degrees of local strong convexity is proposed to reduce iteration complexity. Experiments support the efficacy and efficiency of the proposed approach.

[LG-89] Sleep EEG Signal Criticality as a Non-Invasive Predictor of Cognitive Decline in Dementia

链接: https://arxiv.org/abs/2606.10889
作者: Stanisław Narębski,Tomasz Komendziński,Tomasz M. Rutkowski
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026

点击查看摘要

Abstract:Early detection of neurodegeneration remains a critical clinical challenge. This study investigates whether sleep EEG signal criticality, quantified via Multifractal Detrended Fluctuation Analysis (MFDFA), serves as a non-invasive biomarker for future cognitive decline. We analyzed longitudinal data from the National Sleep Research Resource (NSRR) Study of Osteoporotic Fractures (SOF) cohort, comparing baseline sleep EEG dynamics between women who remained cognitively normal and those who later progressed to dementia-related impairment ( 3MS 78 ).Our results reveal significant group-level differences in Hurst exponent H(q) distributions, particularly during non-REM stages N2 and N3. Cognitively healthy individuals exhibited signal dynamics significantly closer to an optimally critical state across all electrode locations ( p \leqslant 0.001 ), supporting the Brain Criticality Hypothesis. Supervised UMAP projections confirmed clear spatial separation between groups throughout the overnight sleep this http URL dementia group demonstrated a shift in DFA exponents toward 1.0 , suggesting that a reconfiguration of scale-free neural dynamics during sleep precedes clinical symptoms. These findings highlight the potential for MFDFA-derived measures to be integrated into automated, sleep-based screening tools, enabling earlier preventative interventions during the prodromal window of dementia.

[LG-90] On-sky demonstration of reinforcement learning for adaptive optics control

链接: https://arxiv.org/abs/2606.10771
作者: Jalo Nousiainen,Vincent Chambouleyron,Benoit Neichel,Sylvain Cetre,Jean-Francois Sauvage,Angelie Alagao,Markus Kasper,Jonathan Dray,Romain Fetick,Byron Engler
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 11 pages, 12 figures accepted by AA

点击查看摘要

Abstract:Reinforcement learning (RL)-based algorithms have recently emerged as a promising approach for adaptive optics (AO) control. In simulations and laboratory experiments, they have demonstrated robustness to real-world effects such as photon and detector noise, misregistration, vibrations, and rapid variations in seeing conditions. However, their performance has not yet been validated on sky. We report the first on-sky demonstration of a reinforcement learning controller for adaptive optics, named Policy Optimization for AO (PO4AO). We further analyze its on-sky behavior and identify directions for improving the algorithm and its implementation.PO4AO was implemented and deployed on the Papyrus adaptive optics system installed at the Coudé focus of the 1.52 m telescope (T152) at the OHP. A Python-based implementation was interfaced with the existing real-time controller (DAO RTC) via shared-memory buffers. The performance of PO4AO was compared to that of a standard integrator controller over several nights, covering a range of flux levels and atmospheric conditions. PO4AO consistently outperformed the standard integrator in all tested configurations. The controller successfully learned and compensated for vibration patterns and demonstrated strong robustness to measurement noise. Once tuned for Papyrus, PO4AO operated in a turnkey fashion, using a single set of hyperparameters across varying observing conditions and science targets. These performance gains were achieved despite a non-optimized Python implementation introducing approximately 750,\mu\texts of additional latency, along with control jitter and occasional frame drops. When properly implemented and optimized, PO4AO constitutes a robust and high-performance turnkey controller for single-conjugate adaptive optics systems, paving the way for broader adoption of reinforcement learning strategies in on-sky AO operations.

[LG-91] Correcting Variable Importance Scored by Random Forests

链接: https://arxiv.org/abs/2606.10770
作者: Guancheng Zhou,Haiping Xu,Jason Liu,Donghui Yan
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 22 pages, 10 figures

点击查看摘要

Abstract:Variable importance produced by Random Forests (RF) is used widely in statistical data analysis, and has played an important role in a variety of tasks such as assisting model interpretation, model selection and diagnosis, and cost-bounded learning etc. However, the calculation of variable importance in RF does not take into account of the correlations among variables, and variables that are correlated to many other variables tend to receive a lower importance index or being completely masked (i.e., with an importance index near zero) by other strongly correlated variables. To prevent influence from unwanted correlated variables in calculating variable importance, we propose to group variables by their conditional correlations (conditional on the response variable). We explore two computationally efficient options, with one grouping variables individually, and then separates the variable of interest from all correlated variables, while the other uses clustering to group variables according to their pair-wise conditional correlations. Our experiments show that both lead to sensible corrections to the importance of variables.

[LG-92] Efficient AI-Inspired Reduction of Feynman Integrals via Tube Seeding

链接: https://arxiv.org/abs/2606.10698
作者: Justin Berman,Francois Charton,Andres Luna,Matthias Wilhelm,Mao Zeng
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 61 pages, 25 figures, 11 tables

点击查看摘要

Abstract:In this paper, we use machine learning to discover a new seeding strategy for integration-by-parts reduction of Feynman integrals, which is a frequent bottleneck in state-of-the-art calculations in theoretical particle and gravitational-wave physics. Our strategy allows us to reduce multi-loop integrals with large numerator powers via essentially the standard Laporta algorithm but with a sparse selection of seed integrals that grows only linearly with the numerator power, whereas existing strategies lead to growth with a polynomial power that increases with the complexity of the integral being reduced. The seeds are restricted to a thin tube-like region that connects the target integral to the master integrals along a zigzag path. We demonstrate the power of our approach by reducing non-planar 2-loop 5-point integrals of rank 20 with numerical kinematics over a finite field, which is prohibitively difficult for the Laporta algorithm with conventional seeding. Going beyond individual integrals, we further demonstrate the reduction of a complete set of top-level rank-10 integrals by dividing the target integrals into several chunks, each of which can be solved by our sparse seeding strategy with considerably less time and a significantly lower memory footprint than other state-of-the-art strategies, making the approach well-suited for phenomenological applications. We provide a proof-of-principle implementation on GitHub at this https URL.

[LG-93] An adaptive framework for the axisymmetric pulsar magnetosphere using physics-informed Kolmogorov-Arnold networks

链接: https://arxiv.org/abs/2606.10686
作者: Spyros Rigas,Ioannis Contopoulos,Georgios Alexandridis,Antonios Nathanail
类目: Computational Physics (physics.comp-ph); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 25 pages, 10 figures. Submitted to Journal of Computational Physics

点击查看摘要

Abstract:The pulsar magnetosphere has only recently been addressed using Physics-Informed Neural Networks (PINNs), by deploying a domain-decomposition approach and treating the separatrix and equatorial current sheet as infinitesimally thin discontinuities. However, this baseline requires extensive manual hyperparameter tuning, achieves limited final accuracy and demands several hours of training. We refine this framework by introducing domain-specific neural architectures based on Kolmogorov-Arnold networks, an automated adaptive training pipeline and a physics-based convergence criterion that eliminate the need for manual calibration. The proposed methodology delivers self-consistent axisymmetric magnetosphere solutions with mean squared errors of the PDE residuals at O(1e-6) in double precision - an improvement of two orders of magnitude over the baseline - while achieving convergence in under 20 minutes in single precision. Importantly, the method reliably resolves stellar radii reduced by up to 80% compared to the baseline, overcoming the severe spatial scale disparities that also challenge traditional solvers. Furthermore, by varying the flux that opens to infinity, we provide a correction to the equation that connects it to the equatorial T-point’s position. The complete framework is released as the open-source library PulsarX.

[LG-94] ClusBench: The Clustering Benchmark Data Resource Youve All Been Waiting For (?)

链接: https://arxiv.org/abs/2606.10673
作者: David P. Hofmeyr
类目: Other Statistics (stat.OT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although some very common test beds exist for assessing the performance of clustering methods, large scale benchmarking is typically limited to relatively simplistic simulation set-ups. Here we describe the production and curation of close to 3000 synthetic data sets, derived from more than 200 publicly available data sets; the majority of which arose from real-world applications. By fitting a flexible non-parametric distribution to each base data set we are able to retain much of the nuance in real-world data which is difficult to reproduce in standard simulations, while also producing data sets whose sizes are sometimes substantially greater than the data sets from which they are derived. The synthetic data sets, plus an accompanying R package, are available for download from this https URL.

[LG-95] Accelerating SAV-based optimization via randomized low-rank Hessian approximation

链接: https://arxiv.org/abs/2606.10562
作者: Ryo Sagawa,Daisuke Furihata,Yuto Miyatake
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:We propose a new optimization method, the Nyström-enhanced relaxed scalar auxiliary variable method (N-RSAV), which incorporates curvature information into the RSAV framework to accelerate convergence while preserving an unconditional modified energy dissipation law. Existing RSAV-based methods rely solely on first-order information and often suffer from slow convergence, particularly for ill-conditioned problems such as those arising in physics-informed neural networks (PINNs). To address this limitation, we design the linear operator in the RSAV scheme using approximate Hessian information obtained from a randomized low-rank Nyström approximation. To preserve the dissipation structure, we enforce positive semidefiniteness through eigenvalue truncation. Furthermore, we introduce an adaptive strategy that reuses the approximate Hessian based on the deviation between the original and modified energies, significantly reducing computational cost. We also provide a convergence analysis of the RSAV scheme with a general positive semidefinite operator under the Polyak-Lojasiewicz (PL) condition and establish corresponding convergence guarantees for N-RSAV under the PL condition and an additional convexity assumption. Numerical experiments on ill-conditioned problems with effectively low-rank structure, including convex quadratic problems and training of PINNs, demonstrate that the proposed methods achieve substantially faster convergence than conventional RSAV-based approaches.

[LG-96] Unsupervised Deep Learning for Limited-Angle STEM-EDX Tomography – Application to 3D Chemical Analysis of Phase-Change Memory Devices

链接: https://arxiv.org/abs/2606.10547
作者: Daniel del Pozo Bueno,Serge Brosset,Theo Monniez,Gabriele Navarro,Philippe Ciuciu,Zineb Saghi
类目: Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
*备注: 29 pages (17 main manuscript + 12 supplementary information), 4 figures, 8 supplementary figures, 1 table, and 4 supplementary tables

点击查看摘要

Abstract:Energy Dispersive X-ray (EDX) tomography in Scanning Transmission Electron Microscopy (STEM) enables 3D compositional and elemental mapping at the nanoscale, but its use is limited by restricted tilt ranges and low-dose conditions required to avoid beam damage. Limited-angle acquisition introduces missing-wedge artefacts such as elongation and anisotropic resolution, while noisy low-dose data further degrade reconstruction quality and quantitative reliability. Here, we introduce an unsupervised deep learning framework based on Deep Image Prior with total variation regularization (DIP-TV) for limited-angle STEM-EDX tomography. We extend it to a multi-channel formulation (DIPm-TV) that jointly reconstructs multiple elemental maps by exploiting spatial correlations. Using a synthetic 3-channel phantom, we show that the method compensates for severe missing-wedge artefacts corresponding to approximately 100^\circ of missing angular range under moderate noise, outperforming simultaneous iterative reconstruction technique and compressed sensing approaches. We apply the method to 3D chemical analysis of Ge-Sb-Te (GST) memory devices in virgin (as-fabricated) and SET (crystalline) operational states. Samples were prepared as cross-sectional focused ion beam lamellae and acquired under a limited-angle tilt range from -40^\circ to +40^\circ with 5^\circ steps and a dose of 2.0\times10^5 e^-/Ang^2 . The multi-channel approach enables voxel-by-voxel elemental reconstruction using only EDX signals without external structural priors such as high-angle annular dark-field imaging. The reconstructed volumes show near-isotropic spatial resolution and reveal compositional heterogeneities associated with device operation. This approach enables 3D chemical characterization in experimentally accessible sample geometries where conventional methods fail due to severe angular limitations.

[LG-97] Bidirectional Random Projections

链接: https://arxiv.org/abs/2606.10377
作者: Chao Lan,Luyuan Yang
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注: Statistics Probability Letters (Elsevier)

点击查看摘要

Abstract:This paper analyzes bidirectional random projections for ordinary least squares (OLS) regression under the fixed design setting. Let (X,Y) \in \mathbbR^n \times p \times \mathbbR^n be a sample and R \in \mathbbR^n_1 \times n, W \in \mathbbR^p \times p_1 be two properly distributed random projections. We develop an expected excess loss bound for the OLS estimator built on (WXR, WY) . Compared to an established bound for OLS estimator built on (XR, Y) , the gap is approximately O\left( p_1 + C \frac1p_1 \right) , where C scales with n_1/n and can be negative for small n_1/n . Its implications are confirmed by numerical results on real-world data.

[LG-98] Near-Exponential Convergence Rates for kNN Classification based on Boltzmann Margin UAI

链接: https://arxiv.org/abs/2606.10361
作者: Luyuan Yang,Shayan Shafaei,Chao Lan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Conference on Uncertainty in Artificial Intelligence (UAI)

点击查看摘要

Abstract:Convergence-rate analysis for classifiers is often conducted under either Tsybakov margin or Massart margin. The former is a relatively weak condition that typically yields polynomial rates, while the latter is substantially stronger but can guarantee exponential rates. In this paper, we introduce a new condition, called Boltzmann margin, that bridges the gap between these two regimes. It is weaker than Massart margin, generally stronger than Tsybakov margin, and can imply many of their properties under suitable conditions. We apply Boltzmann margin to the analysis of kNN classifiers and establish the first near-exponential convergence rates for kNN classification. We also present extensions of the main results and provide numerical evidence supporting the main theoretical implications.

[LG-99] Magnetic HIP-NN for spin dynamics in disordered itinerant magnets

链接: https://arxiv.org/abs/2606.10349
作者: Supriyo Ghosh,Yunhao Fan,Sheng Zhang,Kipton Barros,Gia-Wei Chern
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:We present a magnetic extension of the Hierarchically Interacting Particle Neural Network (HIP-NN) that enables large-scale simulations of electron-mediated spin dynamics in disordered itinerant magnets. The resulting magnetic HIP-NN (mHIP-NN) incorporates rotationally invariant spin correlations directly into hierarchical message-passing layers, enabling the network to learn emergent magnetic energy landscapes and effective local fields from coupled geometric-spin environments while preserving spin-rotation symmetry. As a benchmark application, we consider structurally disordered itinerant s - d exchange models in which the effective magnetic forces arise dynamically from the instantaneous electronic structure and are computationally prohibitive to evaluate using conventional exact-diagonalization-based approaches. We show that mHIP-NN accurately reproduces the local torques governing Landau-Lifshitz-Gilbert dynamics and faithfully captures the nonequilibrium evolution of spatial spin correlations following thermal quenches. Our results establish symmetry-aware hierarchical message-passing networks as an efficient and scalable framework for large-scale simulations of frustrated itinerant spin systems and nonequilibrium magnetic dynamics. More broadly, because the learned energy functional remains fully differentiable with respect to both atomic coordinates and spin variables, the framework also provides a natural foundation for spin-dependent interatomic potentials and coupled atom-spin dynamics.

[LG-100] k-Nearest Neighbors in Gromov–Wasserstein Space

链接: https://arxiv.org/abs/2606.10295
作者: Kaitlyn Hohmeier,Nicolas Fraiman,Caroline Moosmueller
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The Gromov–Wasserstein (GW) distance provides a framework for comparing metric measure spaces, regardless of their underlying structure or geometry. For network-based data, it enables direct comparisons of graphs with different numbers of nodes, without requiring an embedding or other abstraction. Furthermore, through a variant of GW known as fused Gromov–Wasserstein (fGW), it is also possible to incorporate node features in addition to graph structure. In this work, we implement k -nearest neighbors ( k -NN) classification using the GW and fGW distances. We prove the universal consistency of the GW- k -NN classifier on the space of equivalence classes of metric measure spaces with finite support and uniform probability measure. By viewing graphs as finitely supported metric measure spaces equipped with the pairwise distance metric and a uniform probability measure on the nodes, we obtain universal consistency of GW- k -NN for the space of graphs. Likewise for fGW- k -NN, we prove universal consistency on the space of weak isomorphism classes of structured objects consisting of metric measure spaces with finite support and uniform probability measure and feature maps into Euclidean space, thus establishing universal consistency on the space of node-attributed graphs. Our numerical experiments show that GW- k -NN and fGW- k -NN consistently perform well across multiple graph datasets, suggesting that metric classifiers such as k -NN work well in the GW framework.

[LG-101] ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling INTERSPEECH2026

链接: https://arxiv.org/abs/2606.10233
作者: Zhuoyan Tao,Jiatong Shi,Hye-jin Shim,Shinji Watanabe
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.

[LG-102] Decision-Calibrated Conformal Uncertainty for Pacing Decisions in Streaming Advertising

链接: https://arxiv.org/abs/2606.10187
作者: Prashant Shekhar,Caroline Howard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a decision-calibrated conformal framework for pacing decisions in streaming advertising. Pacing depends on uncertain future inventory, demand pressure, incremental response, and member-experience load. Instead of calibrating a generic forecast residual, the framework measures forecast error by its largest impact on the policies that could actually be deployed. The main theorem shows that the proposed score is the smallest valid uncertainty measure that uniformly protects all deployable pacing policies. Geometrically, it is the support function of the signed policy sensitivity set. Split conformal calibration gives finite-sample coverage for this score. A high-dimensional separation theorem shows that traditional residual calibration can be arbitrarily more conservative by paying for nuisance inventory dimensions, and a robust pacing result combines inventory, response, and experience uncertainty. On public-data-calibrated pacing replays built from Criteo Uplift and KuaiRand datasets, traditional conformal pacing remains unresolved with high residual radii of 7236.7 on Criteo and 4629.4 on KuaiRand. With the proposed decision calibration approach, the uncertainty radii are reduced to 18.4 and 278.6 respectively, with separate margins for value, delivery, budget, and member load. On Criteo, the proposed method certifies a less aggressive pacing policy than the point-forecast baseline, and reduces held-out any-violation rate from 16.7% to 3.3%, with zero budget and member-load violations. On KuaiRand, the choice remains unresolved. In a nutshell, the paper establishes that forecasts, response estimates, and member-experience models should be judged by whether they shrink the uncertainty that the pacing decision uses, as this leads to confident decisions that are not overly conservative.

[LG-103] rainability of IQP Quantum Circuit Born Machines Under Gaussian Initialization

链接: https://arxiv.org/abs/2606.10179
作者: Gennaro De Luca
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:Quantum Circuit Born Machines (QCBMs) offer a natural approach to generative machine learning by leveraging the Born rule. Recent work has provided a method to classically train QCBMs with Instantaneous Quantum Polynomial (IQP) circuits via the Maximum Mean Discrepancy (MMD) loss. Despite the assumed intractability of sampling from IQP circuits classically, their expectation values can be computed classically, enabling training of these IQP QCBMs. However, quantum machine learning (QML) models have various other challenges, including trainability issues caused by exponential concentration or barren plateaus. While these issues have been explored for parameters sampled from a uniform distribution, little work has been done to rigorously treat the use of arbitrary Gaussian initialization schemes. This work leverages Stein’s lemma and Lipschitz concentration bounds for Gaussian random variables to provide an analytical lower bound of the variance of the gradient and a probabilistic concentration bound of the deviation of the gradient from its mean. It discusses strategies to either avoid or encourage exponential concentration, as well as the conditions under which barren plateaus are more likely to occur.

[LG-104] Effective Training Principles of Physical Reservoirs

链接: https://arxiv.org/abs/2606.10130
作者: Sobhi Saeed,Mehmet Müftüoglu,Glitta R. Cheeran,Juliane Heim,Bennet Fischer,Mario Chemnitz
类目: Optics (physics.optics); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Reservoir computers benefit from the inherent complexity of optical phenomena, which provide rich, often nonlinear dynamics. However, training directly on the reservoir’s output renders the system prone to overfitting and computationally inefficient during the training phase. In this work, we investigate strategies to mitigate overfitting and reduce computational overhead through output pruning and regularization. We compare loss-minimizing search methods (Equal Search and Branch and Bound) against an output-oriented statistical filtering approach (Variance Filter) and random pruning, highlighting advantages and disadvantages of each approach and the overall importance of informed reservoir output sampling, particularly for a shrinking latent space. We further demonstrate that enforcing readout selection across the full output spectrum improves performance, especially for non-iterative methods. Additionally, we examine L1 and L2 regularization techniques (LASSO and ridge regression), both of which significantly enhance performance on highly nonlinear tasks such as the Spiral Benchmark. While our methods are of general use, results are obtained from and discussed exemplarily for a nonlinear fiber-optical extreme learning machine. Overall, this study provides a deep analysis of the reservoirs’ hidden-layer filtering mechanisms and the output-layer training, enabling optimized performance in physical reservoir computing systems.

[LG-105] Robust Active Learning for Few-Shot Example Selection in Text-to-SQL

链接: https://arxiv.org/abs/2606.10125
作者: Arash Pourhabib
类目: Machine Learning (stat.ML); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 31 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Few-shot example retrieval is the dominant paradigm for grounding large language models (LLMs) in domain-specific text-to-SQL systems. However, the quality of the annotated example bank directly governs system accuracy, and expert annotation is prohibitively expensive. We formalize the active selection of these examples as a constrained experimental design problem over the intrinsic, low-dimensional manifold of semantic query embeddings. Unlike standard active learning frameworks, our setting introduces three critical challenges: varying, query-dependent annotation reliability (heteroscedasticity), strict requirements for spatial diversity across semantic topics (partition matroid constraints), and the inherent reality that the true covariance structure of the embedding space is unknown (misspecification). To address these, we propose a stratified greedy algorithm that maximizes a heteroscedastic mutual information objective. We prove that this objective remains submodular and approximately monotonic on the intrinsic manifold, yielding a theoretical constant-factor approximation guarantee. We establish a spectral bound demonstrating that this approximation guarantee degrades gracefully, rather than catastrophically, when the assumed surrogate kernel diverges from the true underlying data-generating process. Empirical results demonstrate that the proposed strategy significantly reduces labeling effort while maintaining high text-to-SQL retrieval accuracy.

[LG-106] Convergence Rates for Neural-Network Estimation with Current-Status Data

链接: https://arxiv.org/abs/2606.10119
作者: Yuan Wu,Tianhui Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Current-status data arise when an event time is observed only through an indicator of whether it occurred before an examination time. This paper studies a nonparametric neural-network sieve maximum likelihood estimator of the conditional cumulative distribution function of the event time. Under Hölder smoothness assumptions, we establish an explicit convergence rate by combining approximation theory for rectified linear unit neural networks with empirical-process arguments. This result provides theoretical support for neural-network estimation and subsequent inference under current-status observation.

[LG-107] Learning the Universe: Posterior Reliability of Neural Generative Models in High-Dimensional Field-Level Inference of Cosmic Initial Conditions

链接: https://arxiv.org/abs/2606.10023
作者: Ludvig Doeser,Jens Jasche
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: This is a Learning the Universe publication. 19 pages, 18 figures

点击查看摘要

Abstract:Accurate posterior estimation is central to scientific inference, as uncertainties determine what can be reliably learned from observational data. While Markov chain Monte Carlo methods provide asymptotic convergence guarantees, they are computationally demanding in high-dimensional settings. Neural network-based generative models for entire discretized 3D fields enable fast amortized inference but often lack convergence guarantees and principled accuracy assessment. Using Hamiltonian Monte Carlo to obtain reference posterior samples, we conduct a controlled field-level evaluation of an implicit generative model (Stochastic Interpolants) and an explicit likelihood-based model (GLOW normalizing flows). This comparison, unavailable in typical applications, enables the detection of posterior geometry failures that standard metrics cannot capture. As a case study, we consider the cosmological inverse problem of inferring cosmic initial conditions from present-day large-scale structure. To match the precision of modern cosmological data, this problem increasingly relies on complex, non-linear, and non-differentiable simulators, which are incompatible with gradient-based inference frameworks. Generative models offer a route to address these challenges, provided their inferred posteriors are reliable. In this work, we show that matching posterior means, marginal distributions, or achieving high cross-correlation does not imply correct uncertainty structure, as revealed by posterior variance fields and sample-based evaluations. Through this work, we aim to raise awareness of the challenges of uncertainty estimation in high-dimensional field-level settings, highlighting the importance of careful design and validation of neural generative approaches for scientific applications.

[LG-108] JGRA: Jacobian Geometry Robustness Assessment in NISQ Noise-Aware Quantum Neural Networks CCL2026

链接: https://arxiv.org/abs/2606.09964
作者: Gianluca Scanu,Luca Barletta,Stefano Rini
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Accepted at IEEE qCCL 2026. Author accepted manuscript. 6 pages

点击查看摘要

Abstract:The NISQ era places stringent constraints on quantum computation, where noise and decoherence fundamentally limit performance. In classical deep learning, model robustness and resilience to perturbations are well studied: deep neural networks (DNNs) maintain high performance despite pruning, noise injection, and structural perturbations due to inherent redundancy in their representations. A central challenge in quantum machine learning is to transfer this notion of robustness to quantum neural networks (QNNs) under realistic NISQ noise. While classical deep learning exhibits robustness through structural redundancy, analogous principles for QNNs remain underdeveloped. We propose JGRA: a framework for assessing robustness in noise-aware QNNs via Jacobian geometry, capturing model sensitivity to parameter perturbations induced by noise. Our method includes entropy-matched noise calibration, noise-aware training, and noise-conditioned Jacobian extraction, yielding geometric descriptors that link clean-regime structure to noisy inference behaviour. We also empirically demonstrate that these descriptors encode predictive information about robustness under unseen noise.

[LG-109] Stochastic weather generators for high-frequency wind vector time series

链接: https://arxiv.org/abs/2606.09941
作者: Mingshi Cui,Kevin Eng,Justin T. Greene,Zern Ke,Abolfazl Sodagartojgi,Zhiqiu Xia,Gemma E. Moran,Michael L. Stein
类目: Applications (stat.AP); Machine Learning (cs.LG); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:Surface winds can vary substantially from one minute to the next, so there is scope for studying its variation on this fine time scale. Restricting to the month of June to minimize seasonality, this work develops a range of machine learning models for generating realistic time series of surface wind vectors at a site in Lamont, Oklahoma based on more than 30 years of high quality measurements at the minute time scale. Such a generator could be used as an input into models from a range of disciplines, notably for wind energy, but also wildfire spread and aviation, among others. The data show complex diurnal structures in both wind speed and direction that would be challenging to capture with standard time series models, so we consider a number of machine learning approaches to producing a stochastic wind generator based on time vector-quantized variational autoencoders. We consider generating a day’s worth of data at a time and generating a day of wind vectors conditional on the previous day’s winds. We also study methods for incorporating a discrete weather state variable in the generator. We evaluate the generators using a wide range of formal and informal methods. The best of these generators can capture many but not all of the complex features present in the observational data. In particular, the best of our approaches accurately mimic diurnal changes in wind volatility but struggle to match the observed distribution of extreme wind speeds.

[LG-110] Sampling Triangulations and Calabi-Yau Threefolds with Autoregressive GNNs

链接: https://arxiv.org/abs/2605.27770
作者: Nate MacFadden
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 50 pages, 27 figures, 3 tables

点击查看摘要

Abstract:We introduce dualGNN', an autoregressive message-passing GNN for sampling fine, regular triangulations (FRTs) of convex polytopes. dualGNN operates on a generalization of the dual graph of a triangulation, with edges labeled by signed circuits’ – combinatorial invariants from oriented matroid theory which we show are both necessary and sufficient for exposing regularity. The model is independent of the number of points in the polytope and invariant under the polytope’s orientation-preserving symmetries ( \mathrmSL(d,\mathbbZ) \ltimes \mathbbZ^d ). When implemented with a certain masking procedure, one can also guarantee that every rollout produces a fine triangulation (in 2 D). On unseen polygons with N_\mathrmpts \leq 40 , dualGNN is the most uniform FRT sampler we tested, and even a model trained on a single polygon generalizes well to other polygons. The model is small ( \sim92 k parameters), trains in \sim7.5 hours on a single consumer GPU, and runs without modification on an M1 MacBook Pro. We apply dualGNN to string theory, uniformly sampling Calabi-Yau threefolds at h^1,1=86 and consistent with uniformity at h^1,1=128 . This is an order of magnitude beyond previous learned methods with a model \sim1000\times smaller. Code, training scripts, and pretrained models are available at this https URL .

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-06-10

目录

概览 (2026-06-10)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载